unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
@ 2021-06-16 17:55 John Mellor-Crummey via Libc-alpha
  2021-06-17 19:42 ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: John Mellor-Crummey via Libc-alpha @ 2021-06-16 17:55 UTC (permalink / raw)
  To: libc-alpha; +Cc: John Mellor-Crummey

I was encouraged to notify this list about several LD_AUDIT bugs that
have significant impact on performance tools that we are developing
for Linux in general and US Department of Energy (DOE) parallel
supercomputers in particular.

My team develops the HPCToolkit performance tools
(https://hpctoolkit.org, https://github.com/HPCToolkit/hpctoolkit)
under funding from the DOE. We have been modifying our measurement
subsystem to interpose itself between an application and the OS using
glibc’s LD_AUDIT capability.

For such tools to succeed, we need many of LD_AUDIT’s features to
work. Over the last eight months, we identified six bugs on systems we
use. We believe that four are still problems in upstream glibc and two
may be fixed (as noted).

We would like confirmed fixes for the following bugs in upstream
glibc:

----------------------------------------------------------
Priority   | Issue 
—————————————————————————————
VERY       | When using an auditor, there is an unacceptable
HIGH       | performance degradation of over 10x for PLT
           | calls to small procedures even when neither
           | la_pltenter or la_pltexit is present.
----------------------------------------------------------
HIGH       | When auditing, a dlmopen of a shared library
           | causes a SEGV.
----------------------------------------------------------
HIGH       | la_symbind isn't always called when appropriate.
           | We observed that glibc 2.26 calls la_symbind
           | when appropriate; glibc 2.28 does not.
----------------------------------------------------------
HIGH       | glibc does not save all necessary registers
           | (e.g. X8 - the indirect result register, truncated
           | SIMD registers) when auditing on aarch64 since
           | the beginning of time.
----------------------------------------------------------
LOW        | When auditing, a dlopen of a shared library
           | that uses R_X86_64_TLSDESC causes a SEGV. This
           | is reportedly fixed in glibc 2.34.
----------------------------------------------------------
LOW   | An auditor added to an executable at link time
           | with --audit=auditor.so and noted in the DT_AUDIT
           | entry of the dynamic section is not called at
           | runtime. This is reportedly fixed in glibc 2.32.
----------------------------------------------------------



A repository of reproducers for these bugs can be found here:
https://github.com/hpctoolkit/auditor-tests.

A detailed writeup of everything known about each of these bugs,
including links to Red Hat and Sourceware Bugzilla entries, if any are
known to exist, can be found here:
https://docs.google.com/document/d/1dVaDBdzySecxQqD6hLLzDrEF18M1UtjDna9gL5BWWI0/edit?usp=sharing

Technical stakeholders for platforms that are HPCToolkit’s principal
targets under DOE funding, especially the exascale computing program:

----------------------------------------------------------
Stakeholder | Why
----------------------------------------------------------
Intel       | Prime contractor on Aurora exascale system at
            | Argonne National Laboratory
----------------------------------------------------------
IBM         | Prime contractor and processor vendor for Summit
            | and Sierra supercomputers at Oak Ridge National
            | Laboratory and Lawrence Livermore  National
            | Laboratory.  
----------------------------------------------------------
ARM         | Stakeholder who wants all ARM Linux platforms
            | to succeed, including Sandia National Laboratory's
            | Astra supercomputer and SUNY Stony Brook's
            | A64FX-based Ookami.
----------------------------------------------------------
AMD         | Processor vendor for Frontier and El Capitan
            | exascale supercomputers at Oak Ridge and Lawrence
            | Livermore National Laboratories.  
----------------------------------------------------------
SuSE        | Linux distribution provider for Cray systems to be
            | delivered to Oak Ridge and Lawrence Livermore
            | National Laboratories and the A64FX-based system
            | installed at SUNY Stony Brook.  
----------------------------------------------------------
Red Hat     | Linux distribution provider for Oak Ridge
            | National Laboratory s Summit, Lawrence Livermore
            | National Laboratory s Sierra.
----------------------------------------------------------
Cray        | Prime contractor and system vendor for Oak Ridge
            | and Lawrence Livermore National Laboratories,
            | and SUNY Stony Brook; system vendor for Argonne
            | National Laboratory.
----------------------------------------------------------


For reference, here is a pointer to the portion of our tool that uses
the LD_AUDIT interface:
https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/audit/auditor.c

Here are some of the capabilities of LD_AUDIT that we need to work and why:

-  We use LD_AUDIT’s la_objopen and la_objclose to track what objects are
   in an application’s address space so that our measurement subsystem
   can unwind the call stack when a profiling signal is
   received. Tracking libraries by wrapping dlopen is problematic for
   several reasons. For instance, a wrapper would need to implement RPATH
   and RUNPATH semantics because glibc does not provide an alternate
   dlopen interface (like _dlsym) so that a wrapper can provide the
   return address in the requesting library as an argument which glibc
   needs to determine the R_PATH and RUNPATH to use when trying to find
   the library and its dependencies.

-  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
   wrappers around key functions, e.g. pthread_create. This enables a
   tool to intercept functions invoked through pointers obtained with
   dlsym, which preloaded wrappers can’t do. (Note: We don’t use
   la_symbind for interposition yet, but we plan to when it works
   everywhere.)

-  We need auditing to work when an application or a tool library (e.g.,
   Intel’s GT-Pin) opens a shared library with dlmopen.

-  We need auditing to work when opening a dynamic library with TLS
   dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
   any special interest in such relocations; at present, they cause a
   SEGV when auditing and that must be avoided.

-  We want to add an auditor to an application at link time, noted in the
   DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
   program is launched enables our profiler to be injected into an
   application’s address space without a wrapper script that sets
   LD_AUDIT and LD_PRELOAD.

-  LD_AUDIT needs to work on aarch64, which is an important target for
   our tools. The fact that _dl_runtime_profile does not save register x8
   (the indirect result register) is often fatal for applications, which
   makes LD_AUDIT unusable for any purpose on aarch64.

-  LD_AUDIT needs to support auditing of inter-object calls on aarch64
   when SVE registers are in use.

As a final thing to consider: we understand that there is a tension
between security and auditability. We are concerned that changes being
considered for security may compromise observability for tools. For
tools, we would need a way to authorize full observability even in the
cases when that may theoretically reduce security. Perhaps setting
DT_AUDIT could be considered as authorizing full observability.

--
John Mellor-Crummey         Professor
Dept of Computer Science    Rice University
email: johnmc@rice.edu      phone: 713-348-5179


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-16 17:55 A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list) John Mellor-Crummey via Libc-alpha
@ 2021-06-17 19:42 ` Adhemerval Zanella via Libc-alpha
  2021-06-17 20:09   ` Florian Weimer via Libc-alpha
  2021-06-18 17:48   ` John Mellor-Crummey via Libc-alpha
  0 siblings, 2 replies; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-06-17 19:42 UTC (permalink / raw)
  To: John Mellor-Crummey, libc-alpha; +Cc: Florian Weimer



On 16/06/2021 14:55, John Mellor-Crummey via Libc-alpha wrote:
> I was encouraged to notify this list about several LD_AUDIT bugs that
> have significant impact on performance tools that we are developing
> for Linux in general and US Department of Energy (DOE) parallel
> supercomputers in particular.
> 
> My team develops the HPCToolkit performance tools
> (https://hpctoolkit.org, https://github.com/HPCToolkit/hpctoolkit)
> under funding from the DOE. We have been modifying our measurement
> subsystem to interpose itself between an application and the OS using
> glibc’s LD_AUDIT capability.
> 
> For such tools to succeed, we need many of LD_AUDIT’s features to
> work. Over the last eight months, we identified six bugs on systems we
> use. We believe that four are still problems in upstream glibc and two
> may be fixed (as noted).
> 
> We would like confirmed fixes for the following bugs in upstream
> glibc:


Thank you very much to approach us about the current state of audit
interface from the user point of view.  There are some recent discussion
started by Ben Coyote Woodard some time ago that has stalled.

I checked the issues you listed below with master (I can't really
comment their status regarding the usual HPC distributions).

> 
> ----------------------------------------------------------
> Priority   | Issue 
> —————————————————————————————
> VERY       | When using an auditor, there is an unacceptable
> HIGH       | performance degradation of over 10x for PLT
>            | calls to small procedures even when neither
>            | la_pltenter or la_pltexit is present.

The fix provided by https://sourceware.org/legacy-ml/libc-alpha/2013-05/msg00888.html
seems to fix, I am not sure why it has not been applied.  It also seems to not
show any regressions.

I will respin the patch and check if we can add a testcase.


> ----------------------------------------------------------
> HIGH       | When auditing, a dlmopen of a shared library
>            | causes a SEGV.

This does not show on master, I will need to pinpoint exactly which are the
commits that actually fixed it.

> ----------------------------------------------------------
> HIGH       | la_symbind isn't always called when appropriate.
>            | We observed that glibc 2.26 calls la_symbind
>            | when appropriate; glibc 2.28 does not.

I am not sure how to proper fix it, maybe an option is just to disable bind-now
if we have any audit module enabled. We will need to discuss the possible
implications.

(btw your auditor-tests/symbind does not enable bind-now with either
LD_BIND_NOW=1 or -Wl,-z,now).

> ----------------------------------------------------------
> HIGH       | glibc does not save all necessary registers
>            | (e.g. X8 - the indirect result register, truncated
>            | SIMD registers) when auditing on aarch64 since
>            | the beginning of time.

This has been already discussed on the maillist:

* NEON support: we have a proposed [1] patch that addresses it by extending the 
  export struct used. Although the patch seems to fix the issue described by 
  BZ#26643 it is still incomplete: it would required to bump LAV_CURRENT for
  aarch64 (and add a arch-specific way to override it), add tests, and check 
  on how to handle possible incompatbilities. From a briefly chat with other
  glibc maintainer, we can just set that audit version lower LAV_CURRENT are
  just unsupported.
  
* SVE support: as indicated by Szabolcs SVE calls are marked with the 
  STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
  To support SVE, it would require save/restore all registers (but pass down 
  arg and retval registers to pltentry/exit callbacks according to the base 
  PCS). Another option would be to use different LAV_CURRENT version, one for
  NEON and SVE with 128-bits and another for SVE larger than 256-bits.
  This also has performance implications.
  
[1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
[2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html

> ----------------------------------------------------------
> LOW        | When auditing, a dlopen of a shared library
>            | that uses R_X86_64_TLSDESC causes a SEGV. This
>            | is reportedly fixed in glibc 2.34.

It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
however it has regressed by 03e187a41d9.  We need the following fix and we
definitely need a regression tests for this (I will probably use your
auditor-test) as base:

diff --git a/elf/dl-open.c b/elf/dl-open.c
index d2240d8747..d2638ebf05 100644
--- a/elf/dl-open.c
+++ b/elf/dl-open.c
@@ -771,7 +771,7 @@ dl_open_worker (void *a)
     {
       struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
 #ifdef SHARED
-      bool initial = libc_map->l_ns == LM_ID_BASE;
+      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
 #else
       /* In the static case, there is only one namespace, but it
         contains a secondary libc (the primary libc is statically


> ----------------------------------------------------------
> LOW   | An auditor added to an executable at link time
>            | with --audit=auditor.so and noted in the DT_AUDIT
>            | entry of the dynamic section is not called at
>            | runtime. This is reportedly fixed in glibc 2.32.
> ----------------------------------------------------------
> 

This has also fixed by BZ#24943 on glibc 2.32

> 
> 
> A repository of reproducers for these bugs can be found here:
> https://github.com/hpctoolkit/auditor-tests.

Thanks, this is really helpful.

> 
> A detailed writeup of everything known about each of these bugs,
> including links to Red Hat and Sourceware Bugzilla entries, if any are
> known to exist, can be found here:
> https://docs.google.com/document/d/1dVaDBdzySecxQqD6hLLzDrEF18M1UtjDna9gL5BWWI0/edit?usp=sharing
> 

And you are correct about your assessment on the document, we *really* need
more extensible tests for audit interface.  It would be really helpful if
we could add more tests to exercise the real world usage from tools that
rely on audit modules.

> Technical stakeholders for platforms that are HPCToolkit’s principal
> targets under DOE funding, especially the exascale computing program:
> 
> ----------------------------------------------------------
> Stakeholder | Why
> ----------------------------------------------------------
> Intel       | Prime contractor on Aurora exascale system at
>             | Argonne National Laboratory
> ----------------------------------------------------------
> IBM         | Prime contractor and processor vendor for Summit
>             | and Sierra supercomputers at Oak Ridge National
>             | Laboratory and Lawrence Livermore  National
>             | Laboratory.  
> ----------------------------------------------------------
> ARM         | Stakeholder who wants all ARM Linux platforms
>             | to succeed, including Sandia National Laboratory's
>             | Astra supercomputer and SUNY Stony Brook's
>             | A64FX-based Ookami.
> ----------------------------------------------------------
> AMD         | Processor vendor for Frontier and El Capitan
>             | exascale supercomputers at Oak Ridge and Lawrence
>             | Livermore National Laboratories.  
> ----------------------------------------------------------
> SuSE        | Linux distribution provider for Cray systems to be
>             | delivered to Oak Ridge and Lawrence Livermore
>             | National Laboratories and the A64FX-based system
>             | installed at SUNY Stony Brook.  
> ----------------------------------------------------------
> Red Hat     | Linux distribution provider for Oak Ridge
>             | National Laboratory s Summit, Lawrence Livermore
>             | National Laboratory s Sierra.
> ----------------------------------------------------------
> Cray        | Prime contractor and system vendor for Oak Ridge
>             | and Lawrence Livermore National Laboratories,
>             | and SUNY Stony Brook; system vendor for Argonne
>             | National Laboratory.
> ----------------------------------------------------------
> 
> 
> For reference, here is a pointer to the portion of our tool that uses
> the LD_AUDIT interface:
> https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/audit/auditor.c
> 
> Here are some of the capabilities of LD_AUDIT that we need to work and why:
> 
> -  We use LD_AUDIT’s la_objopen and la_objclose to track what objects are
>    in an application’s address space so that our measurement subsystem
>    can unwind the call stack when a profiling signal is
>    received. Tracking libraries by wrapping dlopen is problematic for
>    several reasons. For instance, a wrapper would need to implement RPATH
>    and RUNPATH semantics because glibc does not provide an alternate
>    dlopen interface (like _dlsym) so that a wrapper can provide the
>    return address in the requesting library as an argument which glibc
>    needs to determine the R_PATH and RUNPATH to use when trying to find
>    the library and its dependencies.

It is not clear to me  what kind of extensible interface for audit you are 
asking here or if la_objopen/la_objclose is already suffice.

> 
> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>    wrappers around key functions, e.g. pthread_create. This enables a
>    tool to intercept functions invoked through pointers obtained with
>    dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>    la_symbind for interposition yet, but we plan to when it works
>    everywhere.)
> 
> -  We need auditing to work when an application or a tool library (e.g.,
>    Intel’s GT-Pin) opens a shared library with dlmopen.
> 
> -  We need auditing to work when opening a dynamic library with TLS
>    dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>    any special interest in such relocations; at present, they cause a
>    SEGV when auditing and that must be avoided.

This should be fixed minus the regression above.

> 
> -  We want to add an auditor to an application at link time, noted in the
>    DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>    program is launched enables our profiler to be injected into an
>    application’s address space without a wrapper script that sets
>    LD_AUDIT and LD_PRELOAD.

So if I understood correctly you are asking something like a DT_FILTER
for DT_AUDIT?

> 
> -  LD_AUDIT needs to work on aarch64, which is an important target for
>    our tools. The fact that _dl_runtime_profile does not save register x8
>    (the indirect result register) is often fatal for applications, which
>    makes LD_AUDIT unusable for any purpose on aarch64.
> 
> -  LD_AUDIT needs to support auditing of inter-object calls on aarch64
>    when SVE registers are in use.
> 
> As a final thing to consider: we understand that there is a tension
> between security and auditability. We are concerned that changes being
> considered for security may compromise observability for tools. For
> tools, we would need a way to authorize full observability even in the
> cases when that may theoretically reduce security. Perhaps setting
> DT_AUDIT could be considered as authorizing full observability.
> 
> --
> John Mellor-Crummey         Professor
> Dept of Computer Science    Rice University
> email: johnmc@rice.edu      phone: 713-348-5179
> 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-17 19:42 ` Adhemerval Zanella via Libc-alpha
@ 2021-06-17 20:09   ` Florian Weimer via Libc-alpha
  2021-06-17 23:06     ` Adhemerval Zanella via Libc-alpha
  2021-06-21 19:42     ` John Mellor-Crummey via Libc-alpha
  2021-06-18 17:48   ` John Mellor-Crummey via Libc-alpha
  1 sibling, 2 replies; 23+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-06-17 20:09 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: John Mellor-Crummey, libc-alpha

* Adhemerval Zanella:

> I checked the issues you listed below with master (I can't really
> comment their status regarding the usual HPC distributions).

Thank you for reviewing the status quo.

>> ----------------------------------------------------------
>> HIGH       | la_symbind isn't always called when appropriate.
>>            | We observed that glibc 2.26 calls la_symbind
>>            | when appropriate; glibc 2.28 does not.
>
> I am not sure how to proper fix it, maybe an option is just to disable bind-now
> if we have any audit module enabled. We will need to discuss the possible
> implications.
>
> (btw your auditor-tests/symbind does not enable bind-now with either
> LD_BIND_NOW=1 or -Wl,-z,now).

The issue is that the la_symbind interface is not very good at
communicating that PLT enter/exit hooks aren't available under these
circumstances.  Maybe we can provide another flag and signal an error if
la_symbind requests enter/exit hooks in BIND_NOW mode.

>> ----------------------------------------------------------
>> HIGH       | glibc does not save all necessary registers
>>            | (e.g. X8 - the indirect result register, truncated
>>            | SIMD registers) when auditing on aarch64 since
>>            | the beginning of time.
>
> This has been already discussed on the maillist:
>
> * NEON support: we have a proposed [1] patch that addresses it by extending the 
>   export struct used. Although the patch seems to fix the issue described by 
>   BZ#26643 it is still incomplete: it would required to bump LAV_CURRENT for
>   aarch64 (and add a arch-specific way to override it), add tests, and check 
>   on how to handle possible incompatbilities. From a briefly chat with other
>   glibc maintainer, we can just set that audit version lower LAV_CURRENT are
>   just unsupported.
>   
> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>   STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>   To support SVE, it would require save/restore all registers (but pass down 
>   arg and retval registers to pltentry/exit callbacks according to the base 
>   PCS). Another option would be to use different LAV_CURRENT version, one for
>   NEON and SVE with 128-bits and another for SVE larger than 256-bits.
>   This also has performance implications.

SVE support overlaps with the la_symbind issue quoted above because
those bindings are conceptually BIND_NOW.
  
> [1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
> [2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html
>
>> ----------------------------------------------------------
>> LOW        | When auditing, a dlopen of a shared library
>>            | that uses R_X86_64_TLSDESC causes a SEGV. This
>>            | is reportedly fixed in glibc 2.34.
>
> It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
> however it has regressed by 03e187a41d9.  We need the following fix and we
> definitely need a regression tests for this (I will probably use your
> auditor-test) as base:
>
> diff --git a/elf/dl-open.c b/elf/dl-open.c
> index d2240d8747..d2638ebf05 100644
> --- a/elf/dl-open.c
> +++ b/elf/dl-open.c
> @@ -771,7 +771,7 @@ dl_open_worker (void *a)
>      {
>        struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
>  #ifdef SHARED
> -      bool initial = libc_map->l_ns == LM_ID_BASE;
> +      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
>  #else
>        /* In the static case, there is only one namespace, but it
>          contains a secondary libc (the primary libc is statically

Hmm, I'm surprised that this doesn't crash more widely.  Is it because
DT_NEEDED on ld.so preceeds libc.so if __tls_get_addr is used, or
something like that?

> And you are correct about your assessment on the document, we *really* need
> more extensible tests for audit interface.  It would be really helpful if
> we could add more tests to exercise the real world usage from tools that
> rely on audit modules.

Agreed.  It would also be helpful to verify that the callbacks happened
in the expected order.  This is always a troublesome aspect of a
callback-based interface.

>> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>>    wrappers around key functions, e.g. pthread_create. This enables a
>>    tool to intercept functions invoked through pointers obtained with
>>    dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>>    la_symbind for interposition yet, but we plan to when it works
>>    everywhere.)
>> 
>> -  We need auditing to work when an application or a tool library (e.g.,
>>    Intel’s GT-Pin) opens a shared library with dlmopen.
>> 
>> -  We need auditing to work when opening a dynamic library with TLS
>>    dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>>    any special interest in such relocations; at present, they cause a
>>    SEGV when auditing and that must be avoided.
>
> This should be fixed minus the regression above.

pthread_create interception becomes more difficult in glibc 2.34 because
the pthread_create symbol is no longer interposable.

We could make pthread_create interposable in the same we way do that
today for malloc.

pthread_create interposition on glibc 2.0 architectures needs version
information in la_symbind.  No 64-bit architecture except Alpha is that
old, so maybe this doesn't matter today.  (libstdc++ currently binds to
the pthread_create compatibility symbol on these architectures, so the
issue is quite visible, but a rebuild of GCC against glibc 2.34 will fix
that, too.  Then the default version will be used.)

One caveat: We should be able to make la_symbind work on current
distributions with their hardened build defaults, *however* PLT
enter/exit hooks will not work.  For the time being, you would have to
find a way to wrap the call yourself.  This is theoretically fixable
even without run-time code generation (Madhavan T. Venkataraman from
Microsoft has submitted patches in this area for libffi), but it's
entirely vaporware at this point.  This looks like a fairly isolated
project someone could work on without spending considerable time on
learning the internals of the dynamic loader.

>> -  We want to add an auditor to an application at link time, noted in the
>>    DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>>    program is launched enables our profiler to be injected into an
>>    application’s address space without a wrapper script that sets
>>    LD_AUDIT and LD_PRELOAD.
>
> So if I understood correctly you are asking something like a DT_FILTER
> for DT_AUDIT?

Indeed, please let us know if the existing DT_AUDIT support does not
work for you.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-17 20:09   ` Florian Weimer via Libc-alpha
@ 2021-06-17 23:06     ` Adhemerval Zanella via Libc-alpha
  2021-06-23 17:42       ` Ben Woodard via Libc-alpha
  2021-06-21 19:42     ` John Mellor-Crummey via Libc-alpha
  1 sibling, 1 reply; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-06-17 23:06 UTC (permalink / raw)
  To: Florian Weimer; +Cc: John Mellor-Crummey, libc-alpha



On 17/06/2021 17:09, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> I checked the issues you listed below with master (I can't really
>> comment their status regarding the usual HPC distributions).
> 
> Thank you for reviewing the status quo.
> 
>>> ----------------------------------------------------------
>>> HIGH       | la_symbind isn't always called when appropriate.
>>>            | We observed that glibc 2.26 calls la_symbind
>>>            | when appropriate; glibc 2.28 does not.
>>
>> I am not sure how to proper fix it, maybe an option is just to disable bind-now
>> if we have any audit module enabled. We will need to discuss the possible
>> implications.
>>
>> (btw your auditor-tests/symbind does not enable bind-now with either
>> LD_BIND_NOW=1 or -Wl,-z,now).
> 
> The issue is that the la_symbind interface is not very good at
> communicating that PLT enter/exit hooks aren't available under these
> circumstances.  Maybe we can provide another flag and signal an error if
> la_symbind requests enter/exit hooks in BIND_NOW mode.

Another option is to provide such information on audit interface itself
through an extension, maybe by adding an extra flag for la_activity()
(LA_ACT_BINDNOW or likewise) or on la_symbind* itself.

In both cases it will most likely require to bump LAV_CURRENT.

> 
>>> ----------------------------------------------------------
>>> HIGH       | glibc does not save all necessary registers
>>>            | (e.g. X8 - the indirect result register, truncated
>>>            | SIMD registers) when auditing on aarch64 since
>>>            | the beginning of time.
>>
>> This has been already discussed on the maillist:
>>
>> * NEON support: we have a proposed [1] patch that addresses it by extending the 
>>   export struct used. Although the patch seems to fix the issue described by 
>>   BZ#26643 it is still incomplete: it would required to bump LAV_CURRENT for
>>   aarch64 (and add a arch-specific way to override it), add tests, and check 
>>   on how to handle possible incompatbilities. From a briefly chat with other
>>   glibc maintainer, we can just set that audit version lower LAV_CURRENT are
>>   just unsupported.
>>   
>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>   STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>>   To support SVE, it would require save/restore all registers (but pass down 
>>   arg and retval registers to pltentry/exit callbacks according to the base 
>>   PCS). Another option would be to use different LAV_CURRENT version, one for
>>   NEON and SVE with 128-bits and another for SVE larger than 256-bits.
>>   This also has performance implications.
> 
> SVE support overlaps with the la_symbind issue quoted above because
> those bindings are conceptually BIND_NOW.

There is additional issues where we need to define whether we will use
a hacky solution to work around the ABI limitation or if we can design
something between.  And it also has the performance implications, it
will be hard not to notice potentially multiple KBs of load/store on
each PLT call.

>   
>> [1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
>> [2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html
>>
>>> ----------------------------------------------------------
>>> LOW        | When auditing, a dlopen of a shared library
>>>            | that uses R_X86_64_TLSDESC causes a SEGV. This
>>>            | is reportedly fixed in glibc 2.34.
>>
>> It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
>> however it has regressed by 03e187a41d9.  We need the following fix and we
>> definitely need a regression tests for this (I will probably use your
>> auditor-test) as base:
>>
>> diff --git a/elf/dl-open.c b/elf/dl-open.c
>> index d2240d8747..d2638ebf05 100644
>> --- a/elf/dl-open.c
>> +++ b/elf/dl-open.c
>> @@ -771,7 +771,7 @@ dl_open_worker (void *a)
>>      {
>>        struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
>>  #ifdef SHARED
>> -      bool initial = libc_map->l_ns == LM_ID_BASE;
>> +      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
>>  #else
>>        /* In the static case, there is only one namespace, but it
>>          contains a secondary libc (the primary libc is statically
> 
> Hmm, I'm surprised that this doesn't crash more widely.  Is it because
> DT_NEEDED on ld.so preceeds libc.so if __tls_get_addr is used, or
> something like that?

I will need to check why we haven't see it with our own testcase.
I in case I will prepare a patch to fix it.

> 
>> And you are correct about your assessment on the document, we *really* need
>> more extensible tests for audit interface.  It would be really helpful if
>> we could add more tests to exercise the real world usage from tools that
>> rely on audit modules.
> 
> Agreed.  It would also be helpful to verify that the callbacks happened
> in the expected order.  This is always a troublesome aspect of a
> callback-based interface.
> 
>>> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>>>    wrappers around key functions, e.g. pthread_create. This enables a
>>>    tool to intercept functions invoked through pointers obtained with
>>>    dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>>>    la_symbind for interposition yet, but we plan to when it works
>>>    everywhere.)
>>>
>>> -  We need auditing to work when an application or a tool library (e.g.,
>>>    Intel’s GT-Pin) opens a shared library with dlmopen.
>>>
>>> -  We need auditing to work when opening a dynamic library with TLS
>>>    dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>>>    any special interest in such relocations; at present, they cause a
>>>    SEGV when auditing and that must be avoided.
>>
>> This should be fixed minus the regression above.
> 
> pthread_create interception becomes more difficult in glibc 2.34 because
> the pthread_create symbol is no longer interposable.
> 
> We could make pthread_create interposable in the same we way do that
> today for malloc.

But it is only a handfull call internall (POSIX timer for instance).
I don't think

> 
> pthread_create interposition on glibc 2.0 architectures needs version
> information in la_symbind.  No 64-bit architecture except Alpha is that
> old, so maybe this doesn't matter today.  (libstdc++ currently binds to
> the pthread_create compatibility symbol on these architectures, so the
> issue is quite visible, but a rebuild of GCC against glibc 2.34 will fix
> that, too.  Then the default version will be used.)
> 
> One caveat: We should be able to make la_symbind work on current
> distributions with their hardened build defaults, *however* PLT
> enter/exit hooks will not work.  For the time being, you would have to
> find a way to wrap the call yourself.  This is theoretically fixable
> even without run-time code generation (Madhavan T. Venkataraman from
> Microsoft has submitted patches in this area for libffi), but it's
> entirely vaporware at this point.  This looks like a fairly isolated
> project someone could work on without spending considerable time on
> learning the internals of the dynamic loader.

Is this what your are referring [1]? It is not clear to me how the v2 of this
does not require a trip to the kernel, does it use a vDSO to place the 
trampolines (also the libffi.v2.txt link is broken)?

In any case, do you think it should be fixable by adding trampolines on
extra mmap memory? 

[1] https://lkml.org/lkml/2020/9/16/643

> 
>>> -  We want to add an auditor to an application at link time, noted in the
>>>    DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>>>    program is launched enables our profiler to be injected into an
>>>    application’s address space without a wrapper script that sets
>>>    LD_AUDIT and LD_PRELOAD.
>>
>> So if I understood correctly you are asking something like a DT_FILTER
>> for DT_AUDIT?
> 
> Indeed, please let us know if the existing DT_AUDIT support does not
> work for you.
> 
> Thanks,
> Florian
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-17 19:42 ` Adhemerval Zanella via Libc-alpha
  2021-06-17 20:09   ` Florian Weimer via Libc-alpha
@ 2021-06-18 17:48   ` John Mellor-Crummey via Libc-alpha
  2021-06-18 18:27     ` Adhemerval Zanella via Libc-alpha
  1 sibling, 1 reply; 23+ messages in thread
From: John Mellor-Crummey via Libc-alpha @ 2021-06-18 17:48 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha

Adhemerval,

Thanks for your detailed response to the LD_AUDIT issues I raised.
My responses to your questions are interspersed below.

John

> On Jun 17, 2021, at 2:42 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
> 
> 
> On 16/06/2021 14:55, John Mellor-Crummey via Libc-alpha wrote:
>> I was encouraged to notify this list about several LD_AUDIT bugs that
>> have significant impact on performance tools that we are developing
>> for Linux in general and US Department of Energy (DOE) parallel
>> supercomputers in particular.
>> 
>> My team develops the HPCToolkit performance tools
>> (https://hpctoolkit.org, https://github.com/HPCToolkit/hpctoolkit)
>> under funding from the DOE. We have been modifying our measurement
>> subsystem to interpose itself between an application and the OS using
>> glibc’s LD_AUDIT capability.
>> 
>> For such tools to succeed, we need many of LD_AUDIT’s features to
>> work. Over the last eight months, we identified six bugs on systems we
>> use. We believe that four are still problems in upstream glibc and two
>> may be fixed (as noted).
>> 
>> We would like confirmed fixes for the following bugs in upstream
>> glibc:
> 
> 
> Thank you very much to approach us about the current state of audit
> interface from the user point of view.  There are some recent discussion
> started by Ben Coyote Woodard some time ago that has stalled.
> 
> I checked the issues you listed below with master (I can't really
> comment their status regarding the usual HPC distributions).
> 
>> 
>> ----------------------------------------------------------
>> Priority   | Issue 
>> —————————————————————————————
>> VERY       | When using an auditor, there is an unacceptable
>> HIGH       | performance degradation of over 10x for PLT
>>           | calls to small procedures even when neither
>>           | la_pltenter or la_pltexit is present.
> 
> The fix provided by https://sourceware.org/legacy-ml/libc-alpha/2013-05/msg00888.html
> seems to fix, I am not sure why it has not been applied.  It also seems to not
> show any regressions.
> 
> I will respin the patch and check if we can add a testcase.

We would be delighted to have this problem resolved!

Trying to avoid the overhead of the profiling resolver in our tool (overriding
the library state set up by glibc to use the standard resolver, ensuring the GOT table 
is writable, etc.) is a complex hack. While we’ve done it right in most cases, there are 
edge cases where we still go through the profiling resolver when monitoring Firefox. 

Having glibc record the binding in the GOT as per usual for PLT calls is best.

> 
> 
>> ----------------------------------------------------------
>> HIGH       | When auditing, a dlmopen of a shared library
>>           | causes a SEGV.
> 
> This does not show on master, I will need to pinpoint exactly which are the
> commits that actually fixed it.
> 
>> ----------------------------------------------------------
>> HIGH       | la_symbind isn't always called when appropriate.
>>           | We observed that glibc 2.26 calls la_symbind
>>           | when appropriate; glibc 2.28 does not.
> 
> I am not sure how to proper fix it, maybe an option is just to disable bind-now
> if we have any audit module enabled. We will need to discuss the possible
> implications.
> 
> (btw your auditor-tests/symbind does not enable bind-now with either
> LD_BIND_NOW=1 or -Wl,-z,now).

Our test was expecting normal lazy binding. 

I would argue that there should be la_symbind callbacks even in the
LD_BIND_NOW case. For us to use la_symbind to replace an application’s use
of a routine with the use of a wrapped routine that informs our tool as needed
(e.g. a call to pthread_create, fork, MPI_Init, etc.), we need the la_symbind callbacks
even if when LD_BIND_NOW is specified for a library.

> 
>> ----------------------------------------------------------
>> HIGH       | glibc does not save all necessary registers
>>           | (e.g. X8 - the indirect result register, truncated
>>           | SIMD registers) when auditing on aarch64 since
>>           | the beginning of time.
> 
> This has been already discussed on the maillist:

I am not surprised that this has been mentioned before. We originally reported 
the problem to Ben Woodard back in September and I know from the Bugzilla records
(and from Ben) that there has been a lot of activity exploring the implications since.

In hindsight, I think our tool may not need this problem to be resolved as long as the
top issue (avoiding profiling PLT calls when it is not wanted) is resolved.
Since we have no interest in profiling PLT calls, I think that having glibc avoid the 
unwanted calls to _dl_runtime_profile by using the regular resolver instead would fix 
our issue since the broken code is in _dl_runtime_profile.

> * NEON support: we have a proposed [1] patch that addresses it by extending the 
>  export struct used. Although the patch seems to fix the issue described by 
>  BZ#26643 it is still incomplete: it would required to bump LAV_CURRENT for
>  aarch64 (and add a arch-specific way to override it), add tests, and check 
>  on how to handle possible incompatbilities. From a briefly chat with other
>  glibc maintainer, we can just set that audit version lower LAV_CURRENT are
>  just unsupported.
> 
> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>  STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>  To support SVE, it would require save/restore all registers (but pass down 
>  arg and retval registers to pltentry/exit callbacks according to the base 
>  PCS). Another option would be to use different LAV_CURRENT version, one for
>  NEON and SVE with 128-bits and another for SVE larger than 256-bits.
>  This also has performance implications.
> 
> [1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
> [2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html
> 
>> ----------------------------------------------------------
>> LOW        | When auditing, a dlopen of a shared library
>>           | that uses R_X86_64_TLSDESC causes a SEGV. This
>>           | is reportedly fixed in glibc 2.34.
> 
> It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
> however it has regressed by 03e187a41d9.  We need the following fix and we
> definitely need a regression tests for this (I will probably use your
> auditor-test) as base:
> 
> diff --git a/elf/dl-open.c b/elf/dl-open.c
> index d2240d8747..d2638ebf05 100644
> --- a/elf/dl-open.c
> +++ b/elf/dl-open.c
> @@ -771,7 +771,7 @@ dl_open_worker (void *a)
>     {
>       struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
> #ifdef SHARED
> -      bool initial = libc_map->l_ns == LM_ID_BASE;
> +      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
> #else
>       /* In the static case, there is only one namespace, but it
>         contains a secondary libc (the primary libc is statically

A fix would be welcome. I don’t understand the glibc issues in detail for this bug
and I’m glad to have help from others that do!

> 
> 
>> ----------------------------------------------------------
>> LOW   | An auditor added to an executable at link time
>>           | with --audit=auditor.so and noted in the DT_AUDIT
>>           | entry of the dynamic section is not called at
>>           | runtime. This is reportedly fixed in glibc 2.32.
>> ----------------------------------------------------------
>> 
> 
> This has also fixed by BZ#24943 on glibc 2.32

Great!

> 
>> 
>> 
>> A repository of reproducers for these bugs can be found here:
>> https://github.com/hpctoolkit/auditor-tests.
> 
> Thanks, this is really helpful.
> 
>> 
>> A detailed writeup of everything known about each of these bugs,
>> including links to Red Hat and Sourceware Bugzilla entries, if any are
>> known to exist, can be found here:
>> https://docs.google.com/document/d/1dVaDBdzySecxQqD6hLLzDrEF18M1UtjDna9gL5BWWI0/edit?usp=sharing
>> 
> 
> And you are correct about your assessment on the document, we *really* need
> more extensible tests for audit interface.  It would be really helpful if
> we could add more tests to exercise the real world usage from tools that
> rely on audit modules.

I know that the things in our test repository aren’t ideal for automated regression
tests since they typically involve looking at the printed output or noticing a SEGV.
At the outset, we were most interested in using them to document the bugs. 
We can work with Ben Woodard to create proper auditor tests.

> 
>> Technical stakeholders for platforms that are HPCToolkit’s principal
>> targets under DOE funding, especially the exascale computing program:
>> 
>> ----------------------------------------------------------
>> Stakeholder | Why
>> ----------------------------------------------------------
>> Intel       | Prime contractor on Aurora exascale system at
>>            | Argonne National Laboratory
>> ----------------------------------------------------------
>> IBM         | Prime contractor and processor vendor for Summit
>>            | and Sierra supercomputers at Oak Ridge National
>>            | Laboratory and Lawrence Livermore  National
>>            | Laboratory.  
>> ----------------------------------------------------------
>> ARM         | Stakeholder who wants all ARM Linux platforms
>>            | to succeed, including Sandia National Laboratory's
>>            | Astra supercomputer and SUNY Stony Brook's
>>            | A64FX-based Ookami.
>> ----------------------------------------------------------
>> AMD         | Processor vendor for Frontier and El Capitan
>>            | exascale supercomputers at Oak Ridge and Lawrence
>>            | Livermore National Laboratories.  
>> ----------------------------------------------------------
>> SuSE        | Linux distribution provider for Cray systems to be
>>            | delivered to Oak Ridge and Lawrence Livermore
>>            | National Laboratories and the A64FX-based system
>>            | installed at SUNY Stony Brook.  
>> ----------------------------------------------------------
>> Red Hat     | Linux distribution provider for Oak Ridge
>>            | National Laboratory s Summit, Lawrence Livermore
>>            | National Laboratory s Sierra.
>> ----------------------------------------------------------
>> Cray        | Prime contractor and system vendor for Oak Ridge
>>            | and Lawrence Livermore National Laboratories,
>>            | and SUNY Stony Brook; system vendor for Argonne
>>            | National Laboratory.
>> ----------------------------------------------------------
>> 
>> 
>> For reference, here is a pointer to the portion of our tool that uses
>> the LD_AUDIT interface:
>> https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/audit/auditor.c
>> 
>> Here are some of the capabilities of LD_AUDIT that we need to work and why:
>> 
>> -  We use LD_AUDIT’s la_objopen and la_objclose to track what objects are
>>   in an application’s address space so that our measurement subsystem
>>   can unwind the call stack when a profiling signal is
>>   received. Tracking libraries by wrapping dlopen is problematic for
>>   several reasons. For instance, a wrapper would need to implement RPATH
>>   and RUNPATH semantics because glibc does not provide an alternate
>>   dlopen interface (like _dlsym) so that a wrapper can provide the
>>   return address in the requesting library as an argument which glibc
>>   needs to determine the R_PATH and RUNPATH to use when trying to find
>>   the library and its dependencies.
> 
> It is not clear to me  what kind of extensible interface for audit you are 
> asking here or if la_objopen/la_objclose is already suffice.

The existing la_objopen/la_objclose suffice, provided there is not a blocker that 
prevents us from using LD_AUDIT. 

However, it is worth noting that there is really no other reasonable option 
(e.g. a simple wrapper for dlopen) since there is not an alternate interface 
for dlopen where one can provide the return address (which dlopen uses to 
determine the applicable RPATH or RUNPATH) as one can with _dlsym.

>> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>>   wrappers around key functions, e.g. pthread_create. This enables a
>>   tool to intercept functions invoked through pointers obtained with
>>   dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>>   la_symbind for interposition yet, but we plan to when it works
>>   everywhere.)
>> 
>> -  We need auditing to work when an application or a tool library (e.g.,
>>   Intel’s GT-Pin) opens a shared library with dlmopen.
>> 
>> -  We need auditing to work when opening a dynamic library with TLS
>>   dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>>   any special interest in such relocations; at present, they cause a
>>   SEGV when auditing and that must be avoided.
> 
> This should be fixed minus the regression above.
> 
>> 
>> -  We want to add an auditor to an application at link time, noted in the
>>   DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>>   program is launched enables our profiler to be injected into an
>>   application’s address space without a wrapper script that sets
>>   LD_AUDIT and LD_PRELOAD.
> 
> So if I understood correctly you are asking something like a DT_FILTER
> for DT_AUDIT?

Here, I am just explaining how we would use DT_AUDIT. A working 
implementation of DT_AUDIT would meet our need here.

>> -  LD_AUDIT needs to work on aarch64, which is an important target for
>>   our tools. The fact that _dl_runtime_profile does not save register x8
>>   (the indirect result register) is often fatal for applications, which
>>   makes LD_AUDIT unusable for any purpose on aarch64.
>> 
>> -  LD_AUDIT needs to support auditing of inter-object calls on aarch64
>>   when SVE registers are in use.
>> 
>> As a final thing to consider: we understand that there is a tension
>> between security and auditability. We are concerned that changes being
>> considered for security may compromise observability for tools. For
>> tools, we would need a way to authorize full observability even in the
>> cases when that may theoretically reduce security. Perhaps setting
>> DT_AUDIT could be considered as authorizing full observability.
>> 
>> --
>> John Mellor-Crummey         Professor
>> Dept of Computer Science    Rice University
>> email: johnmc@rice.edu      phone: 713-348-5179
>> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-18 17:48   ` John Mellor-Crummey via Libc-alpha
@ 2021-06-18 18:27     ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-06-18 18:27 UTC (permalink / raw)
  To: John Mellor-Crummey; +Cc: Florian Weimer, libc-alpha



On 18/06/2021 14:48, John Mellor-Crummey wrote:
> Adhemerval,
> 
> Thanks for your detailed response to the LD_AUDIT issues I raised.
> My responses to your questions are interspersed below.
> 
> John
> 
>> On Jun 17, 2021, at 2:42 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 16/06/2021 14:55, John Mellor-Crummey via Libc-alpha wrote:
>>> I was encouraged to notify this list about several LD_AUDIT bugs that
>>> have significant impact on performance tools that we are developing
>>> for Linux in general and US Department of Energy (DOE) parallel
>>> supercomputers in particular.
>>>
>>> My team develops the HPCToolkit performance tools
>>> (https://hpctoolkit.org, https://github.com/HPCToolkit/hpctoolkit)
>>> under funding from the DOE. We have been modifying our measurement
>>> subsystem to interpose itself between an application and the OS using
>>> glibc’s LD_AUDIT capability.
>>>
>>> For such tools to succeed, we need many of LD_AUDIT’s features to
>>> work. Over the last eight months, we identified six bugs on systems we
>>> use. We believe that four are still problems in upstream glibc and two
>>> may be fixed (as noted).
>>>
>>> We would like confirmed fixes for the following bugs in upstream
>>> glibc:
>>
>>
>> Thank you very much to approach us about the current state of audit
>> interface from the user point of view.  There are some recent discussion
>> started by Ben Coyote Woodard some time ago that has stalled.
>>
>> I checked the issues you listed below with master (I can't really
>> comment their status regarding the usual HPC distributions).
>>
>>>
>>> ----------------------------------------------------------
>>> Priority   | Issue 
>>> —————————————————————————————
>>> VERY       | When using an auditor, there is an unacceptable
>>> HIGH       | performance degradation of over 10x for PLT
>>>           | calls to small procedures even when neither
>>>           | la_pltenter or la_pltexit is present.
>>
>> The fix provided by https://sourceware.org/legacy-ml/libc-alpha/2013-05/msg00888.html
>> seems to fix, I am not sure why it has not been applied.  It also seems to not
>> show any regressions.
>>
>> I will respin the patch and check if we can add a testcase.
> 
> We would be delighted to have this problem resolved!
> 
> Trying to avoid the overhead of the profiling resolver in our tool (overriding
> the library state set up by glibc to use the standard resolver, ensuring the GOT table 
> is writable, etc.) is a complex hack. While we’ve done it right in most cases, there are 
> edge cases where we still go through the profiling resolver when monitoring Firefox. 
> 
> Having glibc record the binding in the GOT as per usual for PLT calls is best.

Yes, I will send this fix and the one that fixes the regression for
2.34.

> 
>>
>>
>>> ----------------------------------------------------------
>>> HIGH       | When auditing, a dlmopen of a shared library
>>>           | causes a SEGV.
>>
>> This does not show on master, I will need to pinpoint exactly which are the
>> commits that actually fixed it.
>>
>>> ----------------------------------------------------------
>>> HIGH       | la_symbind isn't always called when appropriate.
>>>           | We observed that glibc 2.26 calls la_symbind
>>>           | when appropriate; glibc 2.28 does not.
>>
>> I am not sure how to proper fix it, maybe an option is just to disable bind-now
>> if we have any audit module enabled. We will need to discuss the possible
>> implications.
>>
>> (btw your auditor-tests/symbind does not enable bind-now with either
>> LD_BIND_NOW=1 or -Wl,-z,now).
> 
> Our test was expecting normal lazy binding. 
> 
> I would argue that there should be la_symbind callbacks even in the
> LD_BIND_NOW case. For us to use la_symbind to replace an application’s use
> of a routine with the use of a wrapped routine that informs our tool as needed
> (e.g. a call to pthread_create, fork, MPI_Init, etc.), we need the la_symbind callbacks
> even if when LD_BIND_NOW is specified for a library.

I will need to check again the order loader do the symbol resolution
for bind-now to check if it would be possible to call the callback
routines. It might be complicated to do it for bind-now though.

> 
>>
>>> ----------------------------------------------------------
>>> HIGH       | glibc does not save all necessary registers
>>>           | (e.g. X8 - the indirect result register, truncated
>>>           | SIMD registers) when auditing on aarch64 since
>>>           | the beginning of time.
>>
>> This has been already discussed on the maillist:
> 
> I am not surprised that this has been mentioned before. We originally reported 
> the problem to Ben Woodard back in September and I know from the Bugzilla records
> (and from Ben) that there has been a lot of activity exploring the implications since.
> 
> In hindsight, I think our tool may not need this problem to be resolved as long as the
> top issue (avoiding profiling PLT calls when it is not wanted) is resolved.
> Since we have no interest in profiling PLT calls, I think that having glibc avoid the 
> unwanted calls to _dl_runtime_profile by using the regular resolver instead would fix 
> our issue since the broken code is in _dl_runtime_profile.
> 
>> * NEON support: we have a proposed [1] patch that addresses it by extending the 
>>  export struct used. Although the patch seems to fix the issue described by 
>>  BZ#26643 it is still incomplete: it would required to bump LAV_CURRENT for
>>  aarch64 (and add a arch-specific way to override it), add tests, and check 
>>  on how to handle possible incompatbilities. From a briefly chat with other
>>  glibc maintainer, we can just set that audit version lower LAV_CURRENT are
>>  just unsupported.
>>
>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>  STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>>  To support SVE, it would require save/restore all registers (but pass down 
>>  arg and retval registers to pltentry/exit callbacks according to the base 
>>  PCS). Another option would be to use different LAV_CURRENT version, one for
>>  NEON and SVE with 128-bits and another for SVE larger than 256-bits.
>>  This also has performance implications.
>>
>> [1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
>> [2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html
>>
>>> ----------------------------------------------------------
>>> LOW        | When auditing, a dlopen of a shared library
>>>           | that uses R_X86_64_TLSDESC causes a SEGV. This
>>>           | is reportedly fixed in glibc 2.34.
>>
>> It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
>> however it has regressed by 03e187a41d9.  We need the following fix and we
>> definitely need a regression tests for this (I will probably use your
>> auditor-test) as base:
>>
>> diff --git a/elf/dl-open.c b/elf/dl-open.c
>> index d2240d8747..d2638ebf05 100644
>> --- a/elf/dl-open.c
>> +++ b/elf/dl-open.c
>> @@ -771,7 +771,7 @@ dl_open_worker (void *a)
>>     {
>>       struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
>> #ifdef SHARED
>> -      bool initial = libc_map->l_ns == LM_ID_BASE;
>> +      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
>> #else
>>       /* In the static case, there is only one namespace, but it
>>         contains a secondary libc (the primary libc is statically
> 
> A fix would be welcome. I don’t understand the glibc issues in detail for this bug
> and I’m glad to have help from others that do!
> 
>>
>>
>>> ----------------------------------------------------------
>>> LOW   | An auditor added to an executable at link time
>>>           | with --audit=auditor.so and noted in the DT_AUDIT
>>>           | entry of the dynamic section is not called at
>>>           | runtime. This is reportedly fixed in glibc 2.32.
>>> ----------------------------------------------------------
>>>
>>
>> This has also fixed by BZ#24943 on glibc 2.32
> 
> Great!
> 
>>
>>>
>>>
>>> A repository of reproducers for these bugs can be found here:
>>> https://github.com/hpctoolkit/auditor-tests.
>>
>> Thanks, this is really helpful.
>>
>>>
>>> A detailed writeup of everything known about each of these bugs,
>>> including links to Red Hat and Sourceware Bugzilla entries, if any are
>>> known to exist, can be found here:
>>> https://docs.google.com/document/d/1dVaDBdzySecxQqD6hLLzDrEF18M1UtjDna9gL5BWWI0/edit?usp=sharing
>>>
>>
>> And you are correct about your assessment on the document, we *really* need
>> more extensible tests for audit interface.  It would be really helpful if
>> we could add more tests to exercise the real world usage from tools that
>> rely on audit modules.
> 
> I know that the things in our test repository aren’t ideal for automated regression
> tests since they typically involve looking at the printed output or noticing a SEGV.
> At the outset, we were most interested in using them to document the bugs. 
> We can work with Ben Woodard to create proper auditor tests.

I discussed this briefly with Carlos O'Donnel on the weekly call and he
said it would be good if you can lay out the your ldaudit usage and what
is not covered by our tests.  With this information we can start to add
more coverage and regression tests.

> 
>>
>>> Technical stakeholders for platforms that are HPCToolkit’s principal
>>> targets under DOE funding, especially the exascale computing program:
>>>
>>> ----------------------------------------------------------
>>> Stakeholder | Why
>>> ----------------------------------------------------------
>>> Intel       | Prime contractor on Aurora exascale system at
>>>            | Argonne National Laboratory
>>> ----------------------------------------------------------
>>> IBM         | Prime contractor and processor vendor for Summit
>>>            | and Sierra supercomputers at Oak Ridge National
>>>            | Laboratory and Lawrence Livermore  National
>>>            | Laboratory.  
>>> ----------------------------------------------------------
>>> ARM         | Stakeholder who wants all ARM Linux platforms
>>>            | to succeed, including Sandia National Laboratory's
>>>            | Astra supercomputer and SUNY Stony Brook's
>>>            | A64FX-based Ookami.
>>> ----------------------------------------------------------
>>> AMD         | Processor vendor for Frontier and El Capitan
>>>            | exascale supercomputers at Oak Ridge and Lawrence
>>>            | Livermore National Laboratories.  
>>> ----------------------------------------------------------
>>> SuSE        | Linux distribution provider for Cray systems to be
>>>            | delivered to Oak Ridge and Lawrence Livermore
>>>            | National Laboratories and the A64FX-based system
>>>            | installed at SUNY Stony Brook.  
>>> ----------------------------------------------------------
>>> Red Hat     | Linux distribution provider for Oak Ridge
>>>            | National Laboratory s Summit, Lawrence Livermore
>>>            | National Laboratory s Sierra.
>>> ----------------------------------------------------------
>>> Cray        | Prime contractor and system vendor for Oak Ridge
>>>            | and Lawrence Livermore National Laboratories,
>>>            | and SUNY Stony Brook; system vendor for Argonne
>>>            | National Laboratory.
>>> ----------------------------------------------------------
>>>
>>>
>>> For reference, here is a pointer to the portion of our tool that uses
>>> the LD_AUDIT interface:
>>> https://github.com/HPCToolkit/hpctoolkit/blob/master/src/tool/hpcrun/audit/auditor.c
>>>
>>> Here are some of the capabilities of LD_AUDIT that we need to work and why:
>>>
>>> -  We use LD_AUDIT’s la_objopen and la_objclose to track what objects are
>>>   in an application’s address space so that our measurement subsystem
>>>   can unwind the call stack when a profiling signal is
>>>   received. Tracking libraries by wrapping dlopen is problematic for
>>>   several reasons. For instance, a wrapper would need to implement RPATH
>>>   and RUNPATH semantics because glibc does not provide an alternate
>>>   dlopen interface (like _dlsym) so that a wrapper can provide the
>>>   return address in the requesting library as an argument which glibc
>>>   needs to determine the R_PATH and RUNPATH to use when trying to find
>>>   the library and its dependencies.
>>
>> It is not clear to me  what kind of extensible interface for audit you are 
>> asking here or if la_objopen/la_objclose is already suffice.
> 
> The existing la_objopen/la_objclose suffice, provided there is not a blocker that 
> prevents us from using LD_AUDIT. 
> 
> However, it is worth noting that there is really no other reasonable option 
> (e.g. a simple wrapper for dlopen) since there is not an alternate interface 
> for dlopen where one can provide the return address (which dlopen uses to 
> determine the applicable RPATH or RUNPATH) as one can with _dlsym.

Indeed, the problem is extending the interface would probable require bumping
the audit interface version and it incurs in its own set of issue.  So if
we will ever do it, I would like to have more input to design a better and
comprehensible extension (such as the one your just described).

> 
>>> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>>>   wrappers around key functions, e.g. pthread_create. This enables a
>>>   tool to intercept functions invoked through pointers obtained with
>>>   dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>>>   la_symbind for interposition yet, but we plan to when it works
>>>   everywhere.)
>>>
>>> -  We need auditing to work when an application or a tool library (e.g.,
>>>   Intel’s GT-Pin) opens a shared library with dlmopen.
>>>
>>> -  We need auditing to work when opening a dynamic library with TLS
>>>   dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>>>   any special interest in such relocations; at present, they cause a
>>>   SEGV when auditing and that must be avoided.
>>
>> This should be fixed minus the regression above.
>>
>>>
>>> -  We want to add an auditor to an application at link time, noted in the
>>>   DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>>>   program is launched enables our profiler to be injected into an
>>>   application’s address space without a wrapper script that sets
>>>   LD_AUDIT and LD_PRELOAD.
>>
>> So if I understood correctly you are asking something like a DT_FILTER
>> for DT_AUDIT?
> 
> Here, I am just explaining how we would use DT_AUDIT. A working 
> implementation of DT_AUDIT would meet our need here.

Right, I asked about the DT_FILTER because it works as a 'weak' library
DT_NEEDED: if the filter library is not found it is not loaded and does
not trigger a loader failure. I had the impression that it was something
alike you are asking.

> 
>>> -  LD_AUDIT needs to work on aarch64, which is an important target for
>>>   our tools. The fact that _dl_runtime_profile does not save register x8
>>>   (the indirect result register) is often fatal for applications, which
>>>   makes LD_AUDIT unusable for any purpose on aarch64.
>>>
>>> -  LD_AUDIT needs to support auditing of inter-object calls on aarch64
>>>   when SVE registers are in use.
>>>
>>> As a final thing to consider: we understand that there is a tension
>>> between security and auditability. We are concerned that changes being
>>> considered for security may compromise observability for tools. For
>>> tools, we would need a way to authorize full observability even in the
>>> cases when that may theoretically reduce security. Perhaps setting
>>> DT_AUDIT could be considered as authorizing full observability.

Thanks again for reaching the glibc community, I wish more users would
came to us with their requirements and issues.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-17 20:09   ` Florian Weimer via Libc-alpha
  2021-06-17 23:06     ` Adhemerval Zanella via Libc-alpha
@ 2021-06-21 19:42     ` John Mellor-Crummey via Libc-alpha
  2021-06-22  8:15       ` Florian Weimer via Libc-alpha
  1 sibling, 1 reply; 23+ messages in thread
From: John Mellor-Crummey via Libc-alpha @ 2021-06-21 19:42 UTC (permalink / raw)
  To: Florian Weimer; +Cc: John Mellor-Crummey, libc-alpha


> On Jun 17, 2021, at 3:09 PM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * Adhemerval Zanella:
> 
>> I checked the issues you listed below with master (I can't really
>> comment their status regarding the usual HPC distributions).
> 
> Thank you for reviewing the status quo.
> 
>>> ----------------------------------------------------------
>>> HIGH       | la_symbind isn't always called when appropriate.
>>>           | We observed that glibc 2.26 calls la_symbind
>>>           | when appropriate; glibc 2.28 does not.
>> 
>> I am not sure how to proper fix it, maybe an option is just to disable bind-now
>> if we have any audit module enabled. We will need to discuss the possible
>> implications.
>> 
>> (btw your auditor-tests/symbind does not enable bind-now with either
>> LD_BIND_NOW=1 or -Wl,-z,now).
> 
> The issue is that the la_symbind interface is not very good at
> communicating that PLT enter/exit hooks aren't available under these
> circumstances.  

This is a separate issue from the one we reported. The issue we reported
was that la_symbind wasn’t called and LD_BIND_NOW was not used.


> Maybe we can provide another flag and signal an error if
> la_symbind requests enter/exit hooks in BIND_NOW mode.

This may be reasonable, though my opinion would probably be different 
if I wanted PLT auditing. If some tool really needed PLT auditing, then 
being told NO might seem less reasonable.

If an auditor was not going to get the callbacks it wants with BIND_NOW, 
perhaps the auditor should be able to override the BIND_NOW.

> 
>>> ----------------------------------------------------------
>>> HIGH       | glibc does not save all necessary registers
>>>           | (e.g. X8 - the indirect result register, truncated
>>>           | SIMD registers) when auditing on aarch64 since
>>>           | the beginning of time.
>> 
>> This has been already discussed on the maillist:
>> 
>> * NEON support: we have a proposed [1] patch that addresses it by extending the 
>>  export struct used. Although the patch seems to fix the issue described by 
>>  BZ#26643 it is still incomplete: it would required to bump LAV_CURRENT for
>>  aarch64 (and add a arch-specific way to override it), add tests, and check 
>>  on how to handle possible incompatbilities. From a briefly chat with other
>>  glibc maintainer, we can just set that audit version lower LAV_CURRENT are
>>  just unsupported.
>> 
>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>  STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>>  To support SVE, it would require save/restore all registers (but pass down 
>>  arg and retval registers to pltentry/exit callbacks according to the base 
>>  PCS). Another option would be to use different LAV_CURRENT version, one for
>>  NEON and SVE with 128-bits and another for SVE larger than 256-bits.
>>  This also has performance implications.
> 
> SVE support overlaps with the la_symbind issue quoted above because
> those bindings are conceptually BIND_NOW.
> 
>> [1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
>> [2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html
>> 
>>> ----------------------------------------------------------
>>> LOW        | When auditing, a dlopen of a shared library
>>>           | that uses R_X86_64_TLSDESC causes a SEGV. This
>>>           | is reportedly fixed in glibc 2.34.
>> 
>> It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
>> however it has regressed by 03e187a41d9.  We need the following fix and we
>> definitely need a regression tests for this (I will probably use your
>> auditor-test) as base:
>> 
>> diff --git a/elf/dl-open.c b/elf/dl-open.c
>> index d2240d8747..d2638ebf05 100644
>> --- a/elf/dl-open.c
>> +++ b/elf/dl-open.c
>> @@ -771,7 +771,7 @@ dl_open_worker (void *a)
>>     {
>>       struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
>> #ifdef SHARED
>> -      bool initial = libc_map->l_ns == LM_ID_BASE;
>> +      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
>> #else
>>       /* In the static case, there is only one namespace, but it
>>         contains a secondary libc (the primary libc is statically
> 
> Hmm, I'm surprised that this doesn't crash more widely.  Is it because
> DT_NEEDED on ld.so preceeds libc.so if __tls_get_addr is used, or
> something like that?
> 
>> And you are correct about your assessment on the document, we *really* need
>> more extensible tests for audit interface.  It would be really helpful if
>> we could add more tests to exercise the real world usage from tools that
>> rely on audit modules.
> 
> Agreed.  It would also be helpful to verify that the callbacks happened
> in the expected order.  This is always a troublesome aspect of a
> callback-based interface.

We used a state machine in our auditor to keep track of what stage we were in.

enum hpcrun_state {
   state_awaiting, state_found, state_attached,
  state_connecting, state_connected, state_disconnected,
};

In the LD_AUDIT interface, some callbacks inspect the current state of our auditor,
take appropriate action based on the state, and changing the state, if appropriate.

One could use such an approach with a state transition table that specifies the states
where a callback is legal and then perhaps a new state value that should be set before
the callback completes. Such a transition table should enable you to verify that the callbacks
occured in the expected partial order.

> 
>>> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>>>   wrappers around key functions, e.g. pthread_create. This enables a
>>>   tool to intercept functions invoked through pointers obtained with
>>>   dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>>>   la_symbind for interposition yet, but we plan to when it works
>>>   everywhere.)
>>> 
>>> -  We need auditing to work when an application or a tool library (e.g.,
>>>   Intel’s GT-Pin) opens a shared library with dlmopen.
>>> 
>>> -  We need auditing to work when opening a dynamic library with TLS
>>>   dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>>>   any special interest in such relocations; at present, they cause a
>>>   SEGV when auditing and that must be avoided.
>> 
>> This should be fixed minus the regression above.
> 
> pthread_create interception becomes more difficult in glibc 2.34 because
> the pthread_create symbol is no longer interposable.

I don’t understand why pthread_create will no longer be interposable in 2.34.
We have a set of other functions that we also need to intercept, shown below:

_Exit
_exit
execl
execle
execlp
execv
execve
execvp
exit
fork
pthread_create
pthread_exit
pthread_sigmask
sigaction
signal
sigprocmask
sigtimedwait
sigwait
sigwaitinfo
system
vfork

If by "pthread_create symbol is no longer interposable", that means we can’t 
insert a wrapper, then that is very bad for performance tools. Our tool and other
performance tools need to intercept thread creation, take appropriate action
in both the calling thread, and arrange for the new thread to call a wrapper as 
well so that the new thread can properly set up performance monitoring before
beginning application code begins to execute in the new thread.

> We could make pthread_create interposable in the same we way do that
> today for malloc.

Should we expect any problems for the other functions listed above in addition to pthread_create?

> 
> pthread_create interposition on glibc 2.0 architectures needs version
> information in la_symbind.  No 64-bit architecture except Alpha is that
> old, so maybe this doesn't matter today.  (libstdc++ currently binds to
> the pthread_create compatibility symbol on these architectures, so the
> issue is quite visible, but a rebuild of GCC against glibc 2.34 will fix
> that, too.  Then the default version will be used.)


> 
> One caveat: We should be able to make la_symbind work on current
> distributions with their hardened build defaults, *however* PLT
> enter/exit hooks will not work.  

	

> For the time being, you would have to
> find a way to wrap the call yourself.  This is theoretically fixable
> even without run-time code generation (Madhavan T. Venkataraman from
> Microsoft has submitted patches in this area for libffi), but it's
> entirely vaporware at this point.  This looks like a fairly isolated
> project someone could work on without spending considerable time on
> learning the internals of the dynamic loader.
> 
>>> -  We want to add an auditor to an application at link time, noted in the
>>>   DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>>>   program is launched enables our profiler to be injected into an
>>>   application’s address space without a wrapper script that sets
>>>   LD_AUDIT and LD_PRELOAD.
>> 
>> So if I understood correctly you are asking something like a DT_FILTER
>> for DT_AUDIT?
> 
> Indeed, please let us know if the existing DT_AUDIT support does not
> work for you.

I’ve never used DT_FILTER before. I just skimmed the materials about DT_FILTER. 
I don’t believe that we need a DT_FILTER capability for LD_AUDIT.
> 
> Thanks,
> Florian
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-21 19:42     ` John Mellor-Crummey via Libc-alpha
@ 2021-06-22  8:15       ` Florian Weimer via Libc-alpha
  2021-06-22 15:04         ` John Mellor-Crummey via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-06-22  8:15 UTC (permalink / raw)
  To: John Mellor-Crummey; +Cc: libc-alpha

* John Mellor-Crummey:

>  On Jun 17, 2021, at 3:09 PM, Florian Weimer <fweimer@redhat.com> wrote:
>
>> The issue is that the la_symbind interface is not very good at
>> communicating that PLT enter/exit hooks aren't available under these
>> circumstances.  
>
> This is a separate issue from the one we reported. The issue we reported
> was that la_symbind wasn’t called and LD_BIND_NOW was not used.

It's kind of related.  Our own example implementation looks like this:

uintptr_t
la_symbind (Elf_Sym *sym, unsigned int ndx, uintptr_t *refcook,
	    uintptr_t *defcook, unsigned int *flags, const char *symname)
{
  if (!do_exit)
    *flags = LA_SYMB_NOPLTEXIT;

  return sym->st_value;
}

Let's assume that we start calling la_symbind in places where there is
no support for enter/exit hooks.  We could initialize *flags with
LA_SYMB_NOPLTENTER | LA_SYMB_NOPLTEXIT, but the code above would clear
the LA_SYMB_NOPLTENTER flag in !do_exit mode.

I want to increase LAV_CURRENT to 2 and call la_symbind in the BIND_NOW
cases only if la_version returned a value greater than 1.  This way, old
audit modules (which are supposed to return LAV_CURRENT from <link.h> in
la_version) will continue to work because they do not see any unexpected
la_symbind calls.

Once we call la_symbind in contexts where no enter/exit hooks are
available, we should initialize the flags to LA_SYMB_NOPLTENTER |
LA_SYMB_NOPLTEXIT (so that la_symbind can detect the situation), and
report a dlopen/loader error if those flags are cleared by la_symbind.
(With our example code, this would call pretty much all binding to fail,
which is why I think we need the LAV_CURRENT change.)

>> pthread_create interception becomes more difficult in glibc 2.34 because
>> the pthread_create symbol is no longer interposable.
>
> I don’t understand why pthread_create will no longer be interposable in 2.34.
> We have a set of other functions that we also need to intercept, shown below:
>
> _Exit
> _exit
> execl
> execle
> execlp
> execv
> execve
> execvp
> exit
> fork
> pthread_create
> pthread_exit
> pthread_sigmask
> sigaction
> signal
> sigprocmask
> sigtimedwait
> sigwait
> sigwaitinfo
> system
> vfork
>
> If by "pthread_create symbol is no longer interposable", that means we
> can’t insert a wrapper, then that is very bad for performance tools.

Once we merge librt and libanl into libc (patches for that have been
posted), mq_notify, the timer functions, and getaddrinfo_a will call
pthread_create using a direct call that cannot be intercepted in this
way.  There is precedent for making things interposable/interceptable
in the form of malloc

  <https://www.gnu.org/software/libc/manual/html_node/Replacing-malloc.html>

but we are currently do not plan to do this for pthread_create.  It
would not be an ABI change as such, so we could introduce the indirect
call as a later change based on user feedback.

You can already see this non-interceptable thread creation behavior
today (in glibc 2.33 and earlier) with thrd_create, which does not
result in a pthread_create call, either, despite creating a new thread
as if by pthread_create.

It's also the reason why your list contains the exec* functions and
system in addition to fork, vfork, and execve, even though system is
implemented on top of those functions: the internal direct calls are
invisible to auditors.  But posix_spawn, posix_spawnp, popen are
missing, too, so you will not trace all created processes.

Starting with glibc 2.32, thread signal masks can also be manipulated
using pthread_attr_setsigmask_np, and that might go unnoticed with your
present sets of intercepts (although the mask change would be visible
from a thread start routine wrapper injected via pthread_create).

Going back to trheading, I find it a bit curious that you intercept
pthread_create, but not pthread_join.  How do you detect thread exit?  I
assume you are interested in that event, too.  Merely wrapping the
thread start routine is insufficient because there are other ways for a
thread to exit besides returning from the start routine and calling
pthread_exit (e.g., thread cancellation and unwinding).

> Should we expect any problems for the other functions listed above in
> addition to pthread_create?

I don't think glibc 2.34 will bring any new problems in this area, but
there are some pre-existing issues around posix_spawn, popen,
pthread_attr_setsigmask_np.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-22  8:15       ` Florian Weimer via Libc-alpha
@ 2021-06-22 15:04         ` John Mellor-Crummey via Libc-alpha
  2021-06-22 15:36           ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: John Mellor-Crummey via Libc-alpha @ 2021-06-22 15:04 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, Xiaozhu Meng, John Mellor-Crummey, Mark W. Krentel



> On Jun 22, 2021, at 3:15 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * John Mellor-Crummey:
> 
>> On Jun 17, 2021, at 3:09 PM, Florian Weimer <fweimer@redhat.com> wrote:
>> 
>>> The issue is that the la_symbind interface is not very good at
>>> communicating that PLT enter/exit hooks aren't available under these
>>> circumstances.  
>> 
>> This is a separate issue from the one we reported. The issue we reported
>> was that la_symbind wasn’t called and LD_BIND_NOW was not used.
> 
> It's kind of related.  Our own example implementation looks like this:
> 
> uintptr_t
> la_symbind (Elf_Sym *sym, unsigned int ndx, uintptr_t *refcook,
> 	    uintptr_t *defcook, unsigned int *flags, const char *symname)
> {
>  if (!do_exit)
>    *flags = LA_SYMB_NOPLTEXIT;
> 
>  return sym->st_value;
> }
> 
> Let's assume that we start calling la_symbind in places where there is
> no support for enter/exit hooks.  We could initialize *flags with
> LA_SYMB_NOPLTENTER | LA_SYMB_NOPLTEXIT, but the code above would clear
> the LA_SYMB_NOPLTENTER flag in !do_exit mode.
> 
> I want to increase LAV_CURRENT to 2 and call la_symbind in the BIND_NOW
> cases only if la_version returned a value greater than 1.  This way, old
> audit modules (which are supposed to return LAV_CURRENT from <link.h> in
> la_version) will continue to work because they do not see any unexpected
> la_symbind calls.
> 
> Once we call la_symbind in contexts where no enter/exit hooks are
> available, we should initialize the flags to LA_SYMB_NOPLTENTER |
> LA_SYMB_NOPLTEXIT (so that la_symbind can detect the situation), and
> report a dlopen/loader error if those flags are cleared by la_symbind.
> (With our example code, this would call pretty much all binding to fail,
> which is why I think we need the LAV_CURRENT change.)

Your suggested change would work for us. I would be happy with receiving the 
la_symbind calls even in the absence of an ability to get PLT enter/exit calls as
we don’t want them anyway for our tool.

To be explicit, since we intend to use la_symbind to provide a wrapped functions rather
than the original in some cases, we want la_symbind callbacks even if BIND_NOW
eager bindings are performed.

> 
>>> pthread_create interception becomes more difficult in glibc 2.34 because
>>> the pthread_create symbol is no longer interposable.
>> 
>> I don’t understand why pthread_create will no longer be interposable in 2.34.
>> We have a set of other functions that we also need to intercept, shown below:
>> 
>> _Exit
>> _exit
>> execl
>> execle
>> execlp
>> execv
>> execve
>> execvp
>> exit
>> fork
>> pthread_create
>> pthread_exit
>> pthread_sigmask
>> sigaction
>> signal
>> sigprocmask
>> sigtimedwait
>> sigwait
>> sigwaitinfo
>> system
>> vfork
>> 
>> If by "pthread_create symbol is no longer interposable", that means we
>> can’t insert a wrapper, then that is very bad for performance tools.
> 
> Once we merge librt and libanl into libc (patches for that have been
> posted), mq_notify, the timer functions, and getaddrinfo_a will call
> pthread_create using a direct call that cannot be intercepted in this
> way.  There is precedent for making things interposable/interceptable
> in the form of malloc
> 
>  <https://www.gnu.org/software/libc/manual/html_node/Replacing-malloc.html>
> 
> but we are currently do not plan to do this for pthread_create.  It
> would not be an ABI change as such, so we could introduce the indirect
> call as a later change based on user feedback.
> 
> You can already see this non-interceptable thread creation behavior
> today (in glibc 2.33 and earlier) with thrd_create, which does not
> result in a pthread_create call, either, despite creating a new thread
> as if by pthread_create.

Having a non-interposable thrd_create is a problem for us too, though 
we haven’t yet seen it in practice in HPC applications (or maybe it happened
and we were just unaware!).

> It's also the reason why your list contains the exec* functions and
> system in addition to fork, vfork, and execve, even though system is
> implemented on top of those functions: the internal direct calls are
> invisible to auditors.  But posix_spawn, posix_spawnp, popen are
> missing, too, so you will not trace all created processes.

Thanks for the advice. We will need to look at posix_spawn, 
posix_spawn_p, and popen.

> Starting with glibc 2.32, thread signal masks can also be manipulated
> using pthread_attr_setsigmask_np, and that might go unnoticed with your
> present sets of intercepts (although the mask change would be visible
> from a thread start routine wrapper injected via pthread_create).

We inspect and manipulate other thread attributes prior to thread creation.
We’ll add signal masks to the list. Thanks for the warning. 

> 
> Going back to trheading, I find it a bit curious that you intercept
> pthread_create, but not pthread_join.  How do you detect thread exit?  I
> assume you are interested in that event, too.  Merely wrapping the
> thread start routine is insufficient because there are other ways for a
> thread to exit besides returning from the start routine and calling
> pthread_exit (e.g., thread cancellation and unwinding).

We use pthread_cleanup_push to add a routine that will be called when a thread
exits.

> 
>> Should we expect any problems for the other functions listed above in
>> addition to pthread_create?
> 
> I don't think glibc 2.34 will bring any new problems in this area, but
> there are some pre-existing issues around posix_spawn, popen,
> pthread_attr_setsigmask_np.
> 
> Thanks,
> Florian
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-22 15:04         ` John Mellor-Crummey via Libc-alpha
@ 2021-06-22 15:36           ` Florian Weimer via Libc-alpha
  2021-06-22 16:17             ` John Mellor-Crummey via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-06-22 15:36 UTC (permalink / raw)
  To: John Mellor-Crummey; +Cc: libc-alpha, Xiaozhu Meng, Mark W. Krentel

* John Mellor-Crummey:

>> On Jun 22, 2021, at 3:15 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> You can already see this non-interceptable thread creation behavior
>> today (in glibc 2.33 and earlier) with thrd_create, which does not
>> result in a pthread_create call, either, despite creating a new thread
>> as if by pthread_create.
>
> Having a non-interposable thrd_create is a problem for us too, though 
> we haven’t yet seen it in practice in HPC applications (or maybe it happened
> and we were just unaware!).

I meant that thrd_create needs to be intercepted separately.  This will
still work.

>> Going back to trheading, I find it a bit curious that you intercept
>> pthread_create, but not pthread_join.  How do you detect thread exit?  I
>> assume you are interested in that event, too.  Merely wrapping the
>> thread start routine is insufficient because there are other ways for a
>> thread to exit besides returning from the start routine and calling
>> pthread_exit (e.g., thread cancellation and unwinding).
>
> We use pthread_cleanup_push to add a routine that will be called when a thread
> exits.

Okay, that should work, although some application-supplied TLS
destructors will run later than that.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-22 15:36           ` Florian Weimer via Libc-alpha
@ 2021-06-22 16:17             ` John Mellor-Crummey via Libc-alpha
  2021-06-22 16:33               ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: John Mellor-Crummey via Libc-alpha @ 2021-06-22 16:17 UTC (permalink / raw)
  To: Florian Weimer
  Cc: libc-alpha, Xiaozhu Meng, John Mellor-Crummey, Mark W. Krentel

Florian,


> On Jun 22, 2021, at 10:36 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * John Mellor-Crummey:
> 
>>> On Jun 22, 2021, at 3:15 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>> You can already see this non-interceptable thread creation behavior
>>> today (in glibc 2.33 and earlier) with thrd_create, which does not
>>> result in a pthread_create call, either, despite creating a new thread
>>> as if by pthread_create.
>> 
>> Having a non-interposable thrd_create is a problem for us too, though 
>> we haven’t yet seen it in practice in HPC applications (or maybe it happened
>> and we were just unaware!).
> 
> I meant that thrd_create needs to be intercepted separately.  This will
> still work.

Got it.

Would you recommend intercepting clone instead of trying
to intercept pthread_create and thrd_create?

That should avoid trouble interposing pthread_create.

> 
>>> Going back to trheading, I find it a bit curious that you intercept
>>> pthread_create, but not pthread_join.  How do you detect thread exit?  I
>>> assume you are interested in that event, too.  Merely wrapping the
>>> thread start routine is insufficient because there are other ways for a
>>> thread to exit besides returning from the start routine and calling
>>> pthread_exit (e.g., thread cancellation and unwinding).
>> 
>> We use pthread_cleanup_push to add a routine that will be called when a thread
>> exits.
> 
> Okay, that should work, although some application-supplied TLS
> destructors will run later than that.

For our tools, we need to shut down profiling when a thread is finishing. That
can and should be done before the thread really gets torn down. We have
learned the hard way that we need to turn off profiling before TLS destruction starts. 

--
John Mellor-Crummey		Professor
Dept of Computer Science	Rice University
email: johnmc@rice.edu		phone: 713-348-5179


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-22 16:17             ` John Mellor-Crummey via Libc-alpha
@ 2021-06-22 16:33               ` Adhemerval Zanella via Libc-alpha
  2021-06-23  6:32                 ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-06-22 16:33 UTC (permalink / raw)
  To: John Mellor-Crummey, Florian Weimer
  Cc: Xiaozhu Meng, libc-alpha, Mark W. Krentel



On 22/06/2021 13:17, John Mellor-Crummey wrote:
> Florian,
> 
> 
>> On Jun 22, 2021, at 10:36 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * John Mellor-Crummey:
>>
>>>> On Jun 22, 2021, at 3:15 AM, Florian Weimer <fweimer@redhat.com> wrote:
>>>> You can already see this non-interceptable thread creation behavior
>>>> today (in glibc 2.33 and earlier) with thrd_create, which does not
>>>> result in a pthread_create call, either, despite creating a new thread
>>>> as if by pthread_create.
>>>
>>> Having a non-interposable thrd_create is a problem for us too, though 
>>> we haven’t yet seen it in practice in HPC applications (or maybe it happened
>>> and we were just unaware!).
>>
>> I meant that thrd_create needs to be intercepted separately.  This will
>> still work.
> 
> Got it.
> 
> Would you recommend intercepting clone instead of trying
> to intercept pthread_create and thrd_create?
> 
> That should avoid trouble interposing pthread_create.

Interposing 'clone' falls in the same category, the syscall is issued directly
within de clode. 

Currently, you need to interpose both pthread_create and thrd_create.  Florian
has suggested we allow pthread_create to be interposable (meaning glibc will
issue a plt call on each usage). 

We can do it for clone instead, it would have the advantage to hide the multiple
architecture different kernel ABIs.


> 
>>
>>>> Going back to trheading, I find it a bit curious that you intercept
>>>> pthread_create, but not pthread_join.  How do you detect thread exit?  I
>>>> assume you are interested in that event, too.  Merely wrapping the
>>>> thread start routine is insufficient because there are other ways for a
>>>> thread to exit besides returning from the start routine and calling
>>>> pthread_exit (e.g., thread cancellation and unwinding).
>>>
>>> We use pthread_cleanup_push to add a routine that will be called when a thread
>>> exits.
>>
>> Okay, that should work, although some application-supplied TLS
>> destructors will run later than that.
> 
> For our tools, we need to shut down profiling when a thread is finishing. That
> can and should be done before the thread really gets torn down. We have
> learned the hard way that we need to turn off profiling before TLS destruction starts. 
> 
> --
> John Mellor-Crummey		Professor
> Dept of Computer Science	Rice University
> email: johnmc@rice.edu		phone: 713-348-5179
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-22 16:33               ` Adhemerval Zanella via Libc-alpha
@ 2021-06-23  6:32                 ` Florian Weimer via Libc-alpha
  2021-06-23 20:06                   ` Mark Krentel via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-06-23  6:32 UTC (permalink / raw)
  To: Adhemerval Zanella
  Cc: libc-alpha, Xiaozhu Meng, John Mellor-Crummey, Mark W. Krentel

* Adhemerval Zanella:

> Currently, you need to interpose both pthread_create and thrd_create.  Florian
> has suggested we allow pthread_create to be interposable (meaning glibc will
> issue a plt call on each usage). 
>
> We can do it for clone instead, it would have the advantage to hide
> the multiple architecture different kernel ABIs.

But the clone call is very low-level.  The start routine is not called
with a fully configured thread, so wrapping the startup routine would be
rather awkward.

The guts of posix_spawn are equally tricky because code is running
without a properly configured TCB and stack.  I'm not sure if it
possible to call an interceptable execve from posix_spawn, for instance.
We would need to know more about interceptor requirements to see if
there is a solution.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-17 23:06     ` Adhemerval Zanella via Libc-alpha
@ 2021-06-23 17:42       ` Ben Woodard via Libc-alpha
  2021-07-30 14:58         ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Woodard via Libc-alpha @ 2021-06-23 17:42 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha



> On Jun 17, 2021, at 4:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
> 
> 
> On 17/06/2021 17:09, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> 
>>> 
>>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>>  STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 

To me this sounds like partly a toolchain issue. The aarch64 PCS does define the ABI for SVE calls. I haven’t checked what GCC/binutils does in quite a while. It seems like the  STO_AARCH64_VARIANT_PCS was an expedient for SVE when it first came out and the semantics that it defines where all the registers are caller preserved makes it very difficult to implement around.

For LAV_CURRENT=2 what I planned to do was:

diff --git a/sysdeps/aarch64/bits/link.h b/sysdeps/aarch64/bits/link.h
index ca76087ee1..390b12a826 100644
--- a/sysdeps/aarch64/bits/link.h
+++ b/sysdeps/aarch64/bits/link.h
@@ -20,13 +20,24 @@
 # error "Never include <bits/link.h> directly; use <link.h> instead."
 #endif
 
+typedef struct La_sve_regs {
+  uint16_t    *lr_preg[3];
+  __uint128_t *lr_zreg[8];
+} La_sve_regs;
+
 /* Registers for entry into PLT on AArch64.  */
 typedef struct La_aarch64_regs
 {
   uint64_t    lr_xreg[9];
-  __uint128_t lr_vreg[8];
   uint64_t    lr_sp;
   uint64_t    lr_lr;
+  char    lr_sve; /* 0 - no SVE,
+                        1-16 length of the SVE registers in vq (128 bits) */
+  union {
+    /* when sve!=0 accessing the lr_vreg is undefined */
+    __uint128_t lr_vreg[8];
+    La_sve_regs lr_zreg;
+  };
 } La_aarch64_regs;
 
 /* Return values for calls from PLT on AArch64.  */
@@ -34,9 +45,14 @@ typedef struct La_aarch64_retval
 {
   /* Up to eight integer registers can be used for a return value.  */
   uint64_t    lrv_xreg[8];
-  /* Up to eight V registers can be used for a return value.  */
-  __uint128_t lrv_vreg[8];
-
+  char        lrv_sve; /* 0 - no SVE,
+                         1-16 length of the SVE registers in vq (128 bits) */
+  union{
+    /* Up to eight V registers can be used for a return value.
+       When sve!=0 accessing the lr_vreg is undefined */
+    __uint128_t lrv_vreg[8];
+    La_sve_regs lrv_zreg;
+  };
 } La_aarch64_retval;
 __BEGIN_DECLS
 
However, that would require toolchain support and another hint in st_other which separates SVE calls from other uses of STO_AARCH64_VARIANT_PCS like STO_AARCH64_VARIANT_SVE. Then the runtime linker could populate the lrv_sve with information from the lrv_vreg with the size of the vector registers from the processor’s registers. 

There are at least two problems with that approach. 
1) who allocates the lr_zreg pointers in the la_sve_regs and how long should they be? Do they always have to be allocated to be the max size 2048 bits?
2) I hadn’t worked out how to handle functions that return things in the SVE regs. Do we need two new flags in st_other? STO_AARCH64_VARIANT_SVE and STO_AARCH64_VARIANT_SVERET?

Then there was the question in the future could there be: big.LITTLE processors where some big processors had SVE registers of one length while the LITTLE processors had different ones? 

I kind of got into the: I don’t know what to do zone.


>>>  To support SVE, it would require save/restore all registers (but pass down 
>>>  arg and retval registers to pltentry/exit callbacks according to the base 
>>>  PCS). Another option would be to use different LAV_CURRENT version, one for
>>>  NEON and SVE with 128-bits and another for SVE larger than 256-bits.
>>>  This also has performance implications.
>> 
>> SVE support overlaps with the la_symbind issue quoted above because
>> those bindings are conceptually BIND_NOW.
> 
> There is additional issues where we need to define whether we will use
> a hacky solution to work around the ABI limitation or if we can design
> something between.  And it also has the performance implications, it
> will be hard not to notice potentially multiple KBs of load/store on
> each PLT call.
> 
>> 
>>> [1] https://patchwork.sourceware.org/project/glibc/patch/20200923160448.2321909-1-woodard@redhat.com/
>>> [2] https://sourceware.org/pipermail/libc-alpha/2020-September/117822.html
>>> 
>>>> ----------------------------------------------------------
>>>> LOW        | When auditing, a dlopen of a shared library
>>>>           | that uses R_X86_64_TLSDESC causes a SEGV. This
>>>>           | is reportedly fixed in glibc 2.34.
>>> 
>>> It should be fixed by BZ#27137 (8f7e09f4dbdb5c815a18b8285fbc5d5d7bc17d86),
>>> however it has regressed by 03e187a41d9.  We need the following fix and we
>>> definitely need a regression tests for this (I will probably use your
>>> auditor-test) as base:
>>> 
>>> diff --git a/elf/dl-open.c b/elf/dl-open.c
>>> index d2240d8747..d2638ebf05 100644
>>> --- a/elf/dl-open.c
>>> +++ b/elf/dl-open.c
>>> @@ -771,7 +771,7 @@ dl_open_worker (void *a)
>>>     {
>>>       struct link_map *libc_map = GL(dl_ns)[args->nsid].libc_map;
>>> #ifdef SHARED
>>> -      bool initial = libc_map->l_ns == LM_ID_BASE;
>>> +      bool initial = libc_map != NULL ? libc_map->l_ns == LM_ID_BASE : false;
>>> #else
>>>       /* In the static case, there is only one namespace, but it
>>>         contains a secondary libc (the primary libc is statically
>> 
>> Hmm, I'm surprised that this doesn't crash more widely.  Is it because
>> DT_NEEDED on ld.so preceeds libc.so if __tls_get_addr is used, or
>> something like that?
> 
> I will need to check why we haven't see it with our own testcase.
> I in case I will prepare a patch to fix it.
> 
>> 
>>> And you are correct about your assessment on the document, we *really* need
>>> more extensible tests for audit interface.  It would be really helpful if
>>> we could add more tests to exercise the real world usage from tools that
>>> rely on audit modules.
>> 
>> Agreed.  It would also be helpful to verify that the callbacks happened
>> in the expected order.  This is always a troublesome aspect of a
>> callback-based interface.
>> 
>>>> -  We want to use LD_AUDIT’s la_symbind32 and la_symbind64 to interpose
>>>>   wrappers around key functions, e.g. pthread_create. This enables a
>>>>   tool to intercept functions invoked through pointers obtained with
>>>>   dlsym, which preloaded wrappers can’t do. (Note: We don’t use
>>>>   la_symbind for interposition yet, but we plan to when it works
>>>>   everywhere.)
>>>> 
>>>> -  We need auditing to work when an application or a tool library (e.g.,
>>>>   Intel’s GT-Pin) opens a shared library with dlmopen.
>>>> 
>>>> -  We need auditing to work when opening a dynamic library with TLS
>>>>   dialect gnu2 relocations on x86_64 (R_X86_64_TLSDESC). We don’t have
>>>>   any special interest in such relocations; at present, they cause a
>>>>   SEGV when auditing and that must be avoided.
>>> 
>>> This should be fixed minus the regression above.
>> 
>> pthread_create interception becomes more difficult in glibc 2.34 because
>> the pthread_create symbol is no longer interposable.
>> 
>> We could make pthread_create interposable in the same we way do that
>> today for malloc.
> 
> But it is only a handfull call internall (POSIX timer for instance).
> I don't think
> 
>> 
>> pthread_create interposition on glibc 2.0 architectures needs version
>> information in la_symbind.  No 64-bit architecture except Alpha is that
>> old, so maybe this doesn't matter today.  (libstdc++ currently binds to
>> the pthread_create compatibility symbol on these architectures, so the
>> issue is quite visible, but a rebuild of GCC against glibc 2.34 will fix
>> that, too.  Then the default version will be used.)
>> 
>> One caveat: We should be able to make la_symbind work on current
>> distributions with their hardened build defaults, *however* PLT
>> enter/exit hooks will not work.  For the time being, you would have to
>> find a way to wrap the call yourself.  This is theoretically fixable
>> even without run-time code generation (Madhavan T. Venkataraman from
>> Microsoft has submitted patches in this area for libffi), but it's
>> entirely vaporware at this point.  This looks like a fairly isolated
>> project someone could work on without spending considerable time on
>> learning the internals of the dynamic loader.
> 
> Is this what your are referring [1]? It is not clear to me how the v2 of this
> does not require a trip to the kernel, does it use a vDSO to place the 
> trampolines (also the libffi.v2.txt link is broken)?
> 
> In any case, do you think it should be fixable by adding trampolines on
> extra mmap memory? 
> 
> [1] https://lkml.org/lkml/2020/9/16/643
> 
>> 
>>>> -  We want to add an auditor to an application at link time, noted in the
>>>>   DT_AUDIT entry of the dynamic section. Loading the DT_AUDIT entry as a
>>>>   program is launched enables our profiler to be injected into an
>>>>   application’s address space without a wrapper script that sets
>>>>   LD_AUDIT and LD_PRELOAD.
>>> 
>>> So if I understood correctly you are asking something like a DT_FILTER
>>> for DT_AUDIT?
>> 
>> Indeed, please let us know if the existing DT_AUDIT support does not
>> work for you.
>> 
>> Thanks,
>> Florian


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-23  6:32                 ` Florian Weimer via Libc-alpha
@ 2021-06-23 20:06                   ` Mark Krentel via Libc-alpha
  0 siblings, 0 replies; 23+ messages in thread
From: Mark Krentel via Libc-alpha @ 2021-06-23 20:06 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-alpha, Xiaozhu Meng, John Mellor-Crummey

Sorry, I was out one day (Tuesday) and I've missed an entire email
thread.  :-)

I agree, we don't want to intercept threads at level of clone().
clone is less portable (args vary by arch type), has variable args,
also mixed in with fork, etc.  That would be much harder and much more
buggy.

thrd_create() is simply a new C11 function that we haven't added yet.
Keep in mind that HPCToolkit goes back well before 2011.

I don't think thrd_create() per se would be a problem.  But I'm not
quite clear on the term "non-iterposable".  Does that mean LD_PRELOAD
is not going to work?  THAT would be a big problem for us.

Let me say in general that for these kinds of things, there is always
a great temptation to go behind the system's back and say, just give
me the "real" thing.  That is simply wrong, wrong, wrong.  There's a
saying in CS that "all problems can be solved by adding an extra layer
of indirection."  That's a bit extreme, but there's a lot of truth to
it.

Imagine a app that wanted to bypass virtual memory because it's doing
low-level gpu programming or whatever and it knows the physical
addresses and wants to use them directly.  Imagine the chaos that
would cause.  That's what MS Windows used to do.

Anyway, there needs to be a clean API line for these things.  If you
bypass the line, then you don't work and play well with others.  And
at some point, all we can do is say, we can't work with your app. :-(

--Mark




> On Jun 23, 2021, at 1:32 AM, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * Adhemerval Zanella:
> 
>> Currently, you need to interpose both pthread_create and thrd_create.  Florian
>> has suggested we allow pthread_create to be interposable (meaning glibc will
>> issue a plt call on each usage). 
>> 
>> We can do it for clone instead, it would have the advantage to hide
>> the multiple architecture different kernel ABIs.
> 
> But the clone call is very low-level.  The start routine is not called
> with a fully configured thread, so wrapping the startup routine would be
> rather awkward.
> 
> The guts of posix_spawn are equally tricky because code is running
> without a properly configured TCB and stack.  I'm not sure if it
> possible to call an interceptable execve from posix_spawn, for instance.
> We would need to know more about interceptor requirements to see if
> there is a solution.
> 
> Thanks,
> Florian
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-06-23 17:42       ` Ben Woodard via Libc-alpha
@ 2021-07-30 14:58         ` Adhemerval Zanella via Libc-alpha
  2021-07-30 18:59           ` Ben Woodard via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-07-30 14:58 UTC (permalink / raw)
  To: Ben Woodard; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha



On 23/06/2021 14:42, Ben Woodard wrote:
> 
> 
>> On Jun 17, 2021, at 4:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 17/06/2021 17:09, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>>
>>>>
>>>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>>>  STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
> 
> To me this sounds like partly a toolchain issue. The aarch64 PCS does define the ABI for SVE calls. I haven’t checked what GCC/binutils does in quite a while. It seems like the  STO_AARCH64_VARIANT_PCS was an expedient for SVE when it first came out and the semantics that it defines where all the registers are caller preserved makes it very difficult to implement around.
> 
> For LAV_CURRENT=2 what I planned to do was:

Now that I have finished the various audit issues that John Mellor-Crummey 
has brought up, I think I have a better idea of how to address it.

> 
> diff --git a/sysdeps/aarch64/bits/link.h b/sysdeps/aarch64/bits/link.h
> index ca76087ee1..390b12a826 100644
> --- a/sysdeps/aarch64/bits/link.h
> +++ b/sysdeps/aarch64/bits/link.h
> @@ -20,13 +20,24 @@
>  # error "Never include <bits/link.h> directly; use <link.h> instead."
>  #endif
>  
> +typedef struct La_sve_regs {
> +  uint16_t    *lr_preg[3];
> +  __uint128_t *lr_zreg[8];
> +} La_sve_regs;
> +
>  /* Registers for entry into PLT on AArch64.  */
>  typedef struct La_aarch64_regs
>  {
>    uint64_t    lr_xreg[9];
> -  __uint128_t lr_vreg[8];
>    uint64_t    lr_sp;
>    uint64_t    lr_lr;
> +  char    lr_sve; /* 0 - no SVE,
> +                        1-16 length of the SVE registers in vq (128 bits) */
> +  union {
> +    /* when sve!=0 accessing the lr_vreg is undefined */
> +    __uint128_t lr_vreg[8];
> +    La_sve_regs lr_zreg;
> +  };
>  } La_aarch64_regs;
>  
>  /* Return values for calls from PLT on AArch64.  */
> @@ -34,9 +45,14 @@ typedef struct La_aarch64_retval
>  {
>    /* Up to eight integer registers can be used for a return value.  */
>    uint64_t    lrv_xreg[8];
> -  /* Up to eight V registers can be used for a return value.  */
> -  __uint128_t lrv_vreg[8];
> -
> +  char        lrv_sve; /* 0 - no SVE,
> +                         1-16 length of the SVE registers in vq (128 bits) */
> +  union{
> +    /* Up to eight V registers can be used for a return value.
> +       When sve!=0 accessing the lr_vreg is undefined */
> +    __uint128_t lrv_vreg[8];
> +    La_sve_regs lrv_zreg;
> +  };
>  } La_aarch64_retval;
>  __BEGIN_DECLS

My idea is to do something similar:
---
typedef struct La_sve_regs
{
  uint16_t    *lr_preg;
  long double *lr_zreg;
} La_sve_regs;

typedef struct La_aarch64_regs
{
  [...]
  uint8_t  lr_sve;           /* 0 - no SVE
                                1-16 length of the SVE registers in vq (128 bits)  */
  La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
};

typedef struct La_aarch64_retval
{
  [...]
  uint8_t  lr_sve;           /* 0 - no SVE
                                1-16 length of the SVE registers in vq (128 bits)  */
  La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
}
---

The _dl_runtime_resolve will be responsible to allocate on the stack the required
space for the La_sve_regs and setup the La_aarch64_regs and La_aarch64_retval internal
pointers.  It has the advantage of allocate only the required space and if the we 
can distinguish if the symbol does use SVE we can avoid the performance issue for 
symbols that do not use SVE. The downside is it would require a potential large stack 
space.

>  
> However, that would require toolchain support and another hint in st_other which separates SVE calls from other uses of STO_AARCH64_VARIANT_PCS like STO_AARCH64_VARIANT_SVE. Then the runtime linker could populate the lrv_sve with information from the lrv_vreg with the size of the vector registers from the processor’s registers. 

I think it should be feasible to assume now that STO_AARCH64_VARIANT_PCS 
means SVE, which meant that we can use it _dl_runtime_resolve to skip
the save/restore if the symbol follows the default standard procedure
call.  

> 
> There are at least two problems with that approach. 
> 1) who allocates the lr_zreg pointers in the la_sve_regs and how long should they be? Do they always have to be allocated to be the max size 2048 bits?
> 2) I hadn’t worked out how to handle functions that return things in the SVE regs. Do we need two new flags in st_other? STO_AARCH64_VARIANT_SVE and STO_AARCH64_VARIANT_SVERET?

I don't think we need the two extra flags to handle SVE calls, but I am
also assuming that STO_AARCH64_VARIANT_PCS will be used solely for SVE.
And it has the extra problem of using two extra flags is not backward
compatible.

I presume that if ARM wants to push for another procedure call variant
on linux-gnu I would expect another flag.

> 
> Then there was the question in the future could there be: big.LITTLE processors where some big processors had SVE registers of one length while the LITTLE processors had different ones? 

Although I find this kind of setup unlikely, my expectation is either that it
would be transparent to userland (either kernel will emulate the required 
instructions or it will bind process with STO_AARCH64_VARIANT_PCS to cores 
with SVE).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-07-30 14:58         ` Adhemerval Zanella via Libc-alpha
@ 2021-07-30 18:59           ` Ben Woodard via Libc-alpha
  2021-07-30 21:09             ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Woodard via Libc-alpha @ 2021-07-30 18:59 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha



> On Jul 30, 2021, at 7:58 AM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
> 
> 
> On 23/06/2021 14:42, Ben Woodard wrote:
>> 
>> 
>>> On Jun 17, 2021, at 4:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>> 
>>> 
>>> 
>>> On 17/06/2021 17:09, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>> 
>>>>> 
>>>>> 
>>>>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>>>> STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>> 
>> To me this sounds like partly a toolchain issue. The aarch64 PCS does define the ABI for SVE calls. I haven’t checked what GCC/binutils does in quite a while. It seems like the  STO_AARCH64_VARIANT_PCS was an expedient for SVE when it first came out and the semantics that it defines where all the registers are caller preserved makes it very difficult to implement around.
>> 
>> For LAV_CURRENT=2 what I planned to do was:
> 
> Now that I have finished the various audit issues that John Mellor-Crummey 
> has brought up, I think I have a better idea of how to address it.

Well the tier1 ones at least. Thank you. I was working on modifying John’s reproducers and making them official glibc tests so that when your  patch series lands in 2.35 they would have tests that ensure that no regressions show up.

I might not have seen it yet? Did you have a new version of 2/6 or a V3 version of the patch set which handles the lazy binding issue that you pointed out in. https://sourceware.org/pipermail/libc-alpha/2021-July/129510.html 

> 
>> 
>> diff --git a/sysdeps/aarch64/bits/link.h b/sysdeps/aarch64/bits/link.h
>> index ca76087ee1..390b12a826 100644
>> --- a/sysdeps/aarch64/bits/link.h
>> +++ b/sysdeps/aarch64/bits/link.h
>> @@ -20,13 +20,24 @@
>> # error "Never include <bits/link.h> directly; use <link.h> instead."
>> #endif
>> 
>> +typedef struct La_sve_regs {
>> +  uint16_t    *lr_preg[3];
>> +  __uint128_t *lr_zreg[8];
>> +} La_sve_regs;
>> +
>> /* Registers for entry into PLT on AArch64.  */
>> typedef struct La_aarch64_regs
>> {
>>   uint64_t    lr_xreg[9];
>> -  __uint128_t lr_vreg[8];
>>   uint64_t    lr_sp;
>>   uint64_t    lr_lr;
>> +  char    lr_sve; /* 0 - no SVE,
>> +                        1-16 length of the SVE registers in vq (128 bits) */
>> +  union {
>> +    /* when sve!=0 accessing the lr_vreg is undefined */
>> +    __uint128_t lr_vreg[8];
>> +    La_sve_regs lr_zreg;
>> +  };
>> } La_aarch64_regs;
>> 
>> /* Return values for calls from PLT on AArch64.  */
>> @@ -34,9 +45,14 @@ typedef struct La_aarch64_retval
>> {
>>   /* Up to eight integer registers can be used for a return value.  */
>>   uint64_t    lrv_xreg[8];
>> -  /* Up to eight V registers can be used for a return value.  */
>> -  __uint128_t lrv_vreg[8];
>> -
>> +  char        lrv_sve; /* 0 - no SVE,
>> +                         1-16 length of the SVE registers in vq (128 bits) */
>> +  union{
>> +    /* Up to eight V registers can be used for a return value.
>> +       When sve!=0 accessing the lr_vreg is undefined */
>> +    __uint128_t lrv_vreg[8];
>> +    La_sve_regs lrv_zreg;
>> +  };
>> } La_aarch64_retval;
>> __BEGIN_DECLS
> 
> My idea is to do something similar:
> ---
> typedef struct La_sve_regs
> {
>  uint16_t    *lr_preg;
>  long double *lr_zreg;
> } La_sve_regs;
> 
> typedef struct La_aarch64_regs
> {
>  [...]
>  uint8_t  lr_sve;           /* 0 - no SVE
>                                1-16 length of the SVE registers in vq (128 bits)  */
>  La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
> };

Take this with a grain of salt you are the expert not me. These are three minor issues that I don’t really like about this approach:
1) I think having lr_sve be an uint8_t will mess up the alignment of the V registers that are immediately after it. It probably needs to be at the end of the structure or it needs to be padded out for alignment.
2) Second it appears like you have to traverse two pointers to actually get at the registers. In my opinion, it would be better if everything were just offsets from the beginning of the structure so that the actual memory for the Z and P registers is immediately following the La_aarch64_regs structure.
3) Don’t the V and the Z registers overlap in the machine architecture? Don’t we want to in some way overlap them so that you don’t have unused V register space in the structures when it is a SVE call?

What I would prefer to see in memory is:

  uint64_t lr_xreg[8];
  uint64_t lr_sp;
  uint64_t lr_lr;
  uint64_t  lr_sve;  // increased in size for alignment

then when lr_sve == 0
  long double lr_vreg[8];
or
  long double lr_zreg[8][lr_sve]
  uint16_t preg[4][lr_sve]
 
or something like that where the V or Z & P registers are immediately adjacent in memory to the rest of the normal registers. You don’t have to traverse pointers to get to them.

Could we have two la_aarch64_gnu_pltenter/exit functions? One which passes the normal La_aarch64_regs and one that passes la_aarch64_sveregs?

or would it be better to have two types, the normal version and then when lr_sve!=0 then you cast it to: La_aarch64_sve_regs which has the Z and the P regs instead.


> 
> typedef struct La_aarch64_retval
> {
>  [...]
>  uint8_t  lr_sve;           /* 0 - no SVE
>                                1-16 length of the SVE registers in vq (128 bits)  */
>  La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
> }
> ---
> 
> The _dl_runtime_resolve will be responsible to allocate on the stack the required
> space for the La_sve_regs and setup the La_aarch64_regs and La_aarch64_retval internal
> pointers.  It has the advantage of allocate only the required space and if the we 
> can distinguish if the symbol does use SVE we can avoid the performance issue for 
> symbols that do not use SVE. The downside is it would require a potential large stack 
> space.
> 
>> 
>> However, that would require toolchain support and another hint in st_other which separates SVE calls from other uses of STO_AARCH64_VARIANT_PCS like STO_AARCH64_VARIANT_SVE. Then the runtime linker could populate the lrv_sve with information from the lrv_vreg with the size of the vector registers from the processor’s registers. 
> 
> I think it should be feasible to assume now that STO_AARCH64_VARIANT_PCS 
> means SVE, which meant that we can use it _dl_runtime_resolve to skip
> the save/restore if the symbol follows the default standard procedure
> call.  

Isn’t STO_AARCH64_VARIANT_PCS used for other kinds of calls as well, ones that require all the registers to be preserved.
I know that it would require some toolchain changes but shouldn’t we define something like STO_AARCH64_VARIANT_SVE to differentiate the two.

> 
>> 
>> There are at least two problems with that approach. 
>> 1) who allocates the lr_zreg pointers in the la_sve_regs and how long should they be? Do they always have to be allocated to be the max size 2048 bits?
>> 2) I hadn’t worked out how to handle functions that return things in the SVE regs. Do we need two new flags in st_other? STO_AARCH64_VARIANT_SVE and STO_AARCH64_VARIANT_SVERET?
> 
> I don't think we need the two extra flags to handle SVE calls, but I am
> also assuming that STO_AARCH64_VARIANT_PCS will be used solely for SVE.

Can you double check with your ARM contacts regarding that? I’m fairly sure that someone either in RH, ARM, or Linaro told me that VARIANT_PCS was also used other unusual situations other than SVE.

> And it has the extra problem of using two extra flags is not backward
> compatible.
> 
> I presume that if ARM wants to push for another procedure call variant
> on linux-gnu I would expect another flag.
> 
>> 
>> Then there was the question in the future could there be: big.LITTLE processors where some big processors had SVE registers of one length while the LITTLE processors had different ones? 
> 
> Although I find this kind of setup unlikely, my expectation is either that it
> would be transparent to userland (either kernel will emulate the required 
> instructions or it will bind process with STO_AARCH64_VARIANT_PCS to cores 
> with SVE).

Can you explain how that would work?

You launch a binary that has SVE from any processor. Somewhere along the way this means one of the exec syscalls handles it in the kernel. The kernel notices that it is an ELF file and hands it off to the ELF interpreter ld.so. Nowhere along the way would the kernel do a deep enough inspection of the binary to notice that it has SVE instructions in it and so how would it know to bind it to a processor which has HW SVE?

I also don’t see any places where ld.so can call back into the kernel and say “uh move me over to a processor that has these capabilities”. 

What I can imagine is a LITTLE processor traps into the kernel when executing a SVE instruction just like any illegal instruction. However, the handler looks at the instruction being executed notices it is an SVE instruction and instead of delivering a SIGILL to the process, it sets a flag in the task structure that says that this process uses SVE and migrates it to one of the big processors that can execute SVE.

-ben


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-07-30 18:59           ` Ben Woodard via Libc-alpha
@ 2021-07-30 21:09             ` Adhemerval Zanella via Libc-alpha
  2021-07-31  0:59               ` Ben Woodard via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-07-30 21:09 UTC (permalink / raw)
  To: Ben Woodard; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha



On 30/07/2021 15:59, Ben Woodard wrote:
> 
> 
>> On Jul 30, 2021, at 7:58 AM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>
>>
>>
>> On 23/06/2021 14:42, Ben Woodard wrote:
>>>
>>>
>>>> On Jun 17, 2021, at 4:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>>>
>>>>
>>>>
>>>> On 17/06/2021 17:09, Florian Weimer wrote:
>>>>> * Adhemerval Zanella:
>>>>>
>>>>>>
>>>>>>
>>>>>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>>>>> STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>>>
>>> To me this sounds like partly a toolchain issue. The aarch64 PCS does define the ABI for SVE calls. I haven’t checked what GCC/binutils does in quite a while. It seems like the  STO_AARCH64_VARIANT_PCS was an expedient for SVE when it first came out and the semantics that it defines where all the registers are caller preserved makes it very difficult to implement around.
>>>
>>> For LAV_CURRENT=2 what I planned to do was:
>>
>> Now that I have finished the various audit issues that John Mellor-Crummey 
>> has brought up, I think I have a better idea of how to address it.
> 
> Well the tier1 ones at least. Thank you. I was working on modifying John’s reproducers and making them official glibc tests so that when your  patch series lands in 2.35 they would have tests that ensure that no regressions show up.
> 
> I might not have seen it yet? Did you have a new version of 2/6 or a V3 version of the patch set which handles the lazy binding issue that you pointed out in. https://sourceware.org/pipermail/libc-alpha/2021-July/129510.html 

I sent it [1].  I changed your aarch64 patch slightly, fixed a small issue, and
added a couple of tests.  It should fixed all tier1 and mostly of tier2 issues,
modulo the aarch64 SVE one.

[1] https://patchwork.sourceware.org/project/glibc/list/?series=2577

> 
>>
>>>
>>> diff --git a/sysdeps/aarch64/bits/link.h b/sysdeps/aarch64/bits/link.h
>>> index ca76087ee1..390b12a826 100644
>>> --- a/sysdeps/aarch64/bits/link.h
>>> +++ b/sysdeps/aarch64/bits/link.h
>>> @@ -20,13 +20,24 @@
>>> # error "Never include <bits/link.h> directly; use <link.h> instead."
>>> #endif
>>>
>>> +typedef struct La_sve_regs {
>>> +  uint16_t    *lr_preg[3];
>>> +  __uint128_t *lr_zreg[8];
>>> +} La_sve_regs;
>>> +
>>> /* Registers for entry into PLT on AArch64.  */
>>> typedef struct La_aarch64_regs
>>> {
>>>   uint64_t    lr_xreg[9];
>>> -  __uint128_t lr_vreg[8];
>>>   uint64_t    lr_sp;
>>>   uint64_t    lr_lr;
>>> +  char    lr_sve; /* 0 - no SVE,
>>> +                        1-16 length of the SVE registers in vq (128 bits) */
>>> +  union {
>>> +    /* when sve!=0 accessing the lr_vreg is undefined */
>>> +    __uint128_t lr_vreg[8];
>>> +    La_sve_regs lr_zreg;
>>> +  };
>>> } La_aarch64_regs;
>>>
>>> /* Return values for calls from PLT on AArch64.  */
>>> @@ -34,9 +45,14 @@ typedef struct La_aarch64_retval
>>> {
>>>   /* Up to eight integer registers can be used for a return value.  */
>>>   uint64_t    lrv_xreg[8];
>>> -  /* Up to eight V registers can be used for a return value.  */
>>> -  __uint128_t lrv_vreg[8];
>>> -
>>> +  char        lrv_sve; /* 0 - no SVE,
>>> +                         1-16 length of the SVE registers in vq (128 bits) */
>>> +  union{
>>> +    /* Up to eight V registers can be used for a return value.
>>> +       When sve!=0 accessing the lr_vreg is undefined */
>>> +    __uint128_t lrv_vreg[8];
>>> +    La_sve_regs lrv_zreg;
>>> +  };
>>> } La_aarch64_retval;
>>> __BEGIN_DECLS
>>
>> My idea is to do something similar:
>> ---
>> typedef struct La_sve_regs
>> {
>>  uint16_t    *lr_preg;
>>  long double *lr_zreg;
>> } La_sve_regs;
>>
>> typedef struct La_aarch64_regs
>> {
>>  [...]
>>  uint8_t  lr_sve;           /* 0 - no SVE
>>                                1-16 length of the SVE registers in vq (128 bits)  */
>>  La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
>> };
> 
> Take this with a grain of salt you are the expert not me. These are three minor issues that I don’t really like about this approach:
> 1) I think having lr_sve be an uint8_t will mess up the alignment of the V registers that are immediately after it. It probably needs to be at the end of the structure or it needs to be padded out for alignment.

No sure if there will be alignment constraint here, since SVE instruction afaik
can load/store unaligned memory and this is not a performance critical code. 
In any case, that's why I added 'long double' for the 'lz_zreg' since it should
at least set 16 bytes alignment.

> 2) Second it appears like you have to traverse two pointers to actually get at the registers. In my opinion, it would be better if everything were just offsets from the beginning of the structure so that the actual memory for the Z and P registers is immediately following the La_aarch64_regs structure.

It is a trade-off to avoid add minimize the struct space to non-SVE code.  But
I don't have a strong preference, we can use this instead:

typedef struct La_aarch64_regs
{
  [...]
  uint8_t  lr_sve;           /* 0 - no SVE
                                1-16 length of the SVE registers in vq (128 bits)  */
  uint16_t    *lr_preg[8];   /* NULL - no SVE.  */
  long double *lr_zreg[4];   /* NULL - no SVE.  */
};

> 3) Don’t the V and the Z registers overlap in the machine architecture? Don’t we want to in some way overlap them so that you don’t have unused V register space in the structures when it is a SVE call?
> 
> What I would prefer to see in memory is:
> 
>   uint64_t lr_xreg[8];
>   uint64_t lr_sp;
>   uint64_t lr_lr;
>   uint64_t  lr_sve;  // increased in size for alignment
> 
> then when lr_sve == 0
>   long double lr_vreg[8];
> or
>   long double lr_zreg[8][lr_sve]
>   uint16_t preg[4][lr_sve]
>  
> or something like that where the V or Z & P registers are immediately adjacent in memory to the rest of the normal registers. You don’t have to traverse pointers to get to them.

We can, but my idea is to make non SVE calls to have no performance or
space implications.  Do we really need to optimize to avoid pointer
transverse? What kind of analysis the audit module usually do?

I am asking because I foresee that any pointer transverse will only be
a fraction of time of the PLT hook runtime itself. 

> 
> Could we have two la_aarch64_gnu_pltenter/exit functions? One which passes the normal La_aarch64_regs and one that passes la_aarch64_sveregs?

It might be an option, but it would add another complexity layer:

  - We would need to add arch-specific hooks on the audit internal 'auditstate'
    interface.

  - We would need to add arch-specific loader code to handle the SVE la_pltenter
    la_pltexit

  - It would require either a different implementation for _dl_runtime_resolve
    to call the correct pltexit hook or a add a way to decide which hook to
    call at runtime.  x86_64 has decided for the former and it sets the expected
    trampoline based on the cpuid() (it would be hwcap2 for aarch64).

I am not sure which would be better, maybe adding SVE specific hooks might be
a better approach indeed (although it would required more code).

> 
> or would it be better to have two types, the normal version and then when lr_sve!=0 then you cast it to: La_aarch64_sve_regs which has the Z and the P regs instead.
> 
> 
>>
>> typedef struct La_aarch64_retval
>> {
>>  [...]
>>  uint8_t  lr_sve;           /* 0 - no SVE
>>                                1-16 length of the SVE registers in vq (128 bits)  */
>>  La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
>> }
>> ---
>>
>> The _dl_runtime_resolve will be responsible to allocate on the stack the required
>> space for the La_sve_regs and setup the La_aarch64_regs and La_aarch64_retval internal
>> pointers.  It has the advantage of allocate only the required space and if the we 
>> can distinguish if the symbol does use SVE we can avoid the performance issue for 
>> symbols that do not use SVE. The downside is it would require a potential large stack 
>> space.
>>
>>>
>>> However, that would require toolchain support and another hint in st_other which separates SVE calls from other uses of STO_AARCH64_VARIANT_PCS like STO_AARCH64_VARIANT_SVE. Then the runtime linker could populate the lrv_sve with information from the lrv_vreg with the size of the vector registers from the processor’s registers. 
>>
>> I think it should be feasible to assume now that STO_AARCH64_VARIANT_PCS 
>> means SVE, which meant that we can use it _dl_runtime_resolve to skip
>> the save/restore if the symbol follows the default standard procedure
>> call.  
> 
> Isn’t STO_AARCH64_VARIANT_PCS used for other kinds of calls as well, ones that require all the registers to be preserved.
> I know that it would require some toolchain changes but shouldn’t we define something like STO_AARCH64_VARIANT_SVE to differentiate the two.

Ideally yes, but it would require both compiler and binutils support. I am not
sure if ARM is willing to move forward, and even though it would take time.

I suggested assuming STO_AARCH64_VARIANT_PCS means SVE because at least for
profiling PLT stub, I think we can check if hwcap supports SVE and then
assume the symbol is indeed a SVE call. It would work for *current* defined
ABI and ISA, but if ARM starts to use it for something else, specially for
newer architecture that also support SVE, it indeed might break.

> 
>>
>>>
>>> There are at least two problems with that approach. 
>>> 1) who allocates the lr_zreg pointers in the la_sve_regs and how long should they be? Do they always have to be allocated to be the max size 2048 bits?
>>> 2) I hadn’t worked out how to handle functions that return things in the SVE regs. Do we need two new flags in st_other? STO_AARCH64_VARIANT_SVE and STO_AARCH64_VARIANT_SVERET?
>>
>> I don't think we need the two extra flags to handle SVE calls, but I am
>> also assuming that STO_AARCH64_VARIANT_PCS will be used solely for SVE.
> 
> Can you double check with your ARM contacts regarding that? I’m fairly sure that someone either in RH, ARM, or Linaro told me that VARIANT_PCS was also used other unusual situations other than SVE.
> 
>> And it has the extra problem of using two extra flags is not backward
>> compatible.
>>
>> I presume that if ARM wants to push for another procedure call variant
>> on linux-gnu I would expect another flag.
>>
>>>
>>> Then there was the question in the future could there be: big.LITTLE processors where some big processors had SVE registers of one length while the LITTLE processors had different ones? 
>>
>> Although I find this kind of setup unlikely, my expectation is either that it
>> would be transparent to userland (either kernel will emulate the required 
>> instructions or it will bind process with STO_AARCH64_VARIANT_PCS to cores 
>> with SVE).
> 
> Can you explain how that would work?
> 
> You launch a binary that has SVE from any processor. Somewhere along the way this means one of the exec syscalls handles it in the kernel. The kernel notices that it is an ELF file and hands it off to the ELF interpreter ld.so. Nowhere along the way would the kernel do a deep enough inspection of the binary to notice that it has SVE instructions in it and so how would it know to bind it to a processor which has HW SVE?
> 
> I also don’t see any places where ld.so can call back into the kernel and say “uh move me over to a processor that has these capabilities”. 
> 
> What I can imagine is a LITTLE processor traps into the kernel when executing a SVE instruction just like any illegal instruction. However, the handler looks at the instruction being executed notices it is an SVE instruction and instead of delivering a SIGILL to the process, it sets a flag in the task structure that says that this process uses SVE and migrates it to one of the big processors that can execute SVE.

Indeed, the STO_AARCH64_VARIANT_PCS would not help in this case.
But you summarize what I think it should happen: if a core executes
the SVE instruct and traps, the kernel would either move to a SVE
core with pc reset, emulate the instruction, or throw a SIGILL.
In any case, I would expect it to be transparent to userland.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-07-30 21:09             ` Adhemerval Zanella via Libc-alpha
@ 2021-07-31  0:59               ` Ben Woodard via Libc-alpha
  2021-08-04 18:11                 ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Woodard via Libc-alpha @ 2021-07-31  0:59 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha



> On Jul 30, 2021, at 2:09 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
> 
> 
> 
> On 30/07/2021 15:59, Ben Woodard wrote:
>> 
>> 
>>> On Jul 30, 2021, at 7:58 AM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>> 
>>> 
>>> 
>>> On 23/06/2021 14:42, Ben Woodard wrote:
>>>> 
>>>> 
>>>>> On Jun 17, 2021, at 4:06 PM, Adhemerval Zanella <adhemerval.zanella@linaro.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On 17/06/2021 17:09, Florian Weimer wrote:
>>>>>> * Adhemerval Zanella:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> * SVE support: as indicated by Szabolcs SVE calls are marked with the 
>>>>>>> STO_AARCH64_VARIANT_PCS and thus explicit not supported by dynamic loader. 
>>>> 
>>>> To me this sounds like partly a toolchain issue. The aarch64 PCS does define the ABI for SVE calls. I haven’t checked what GCC/binutils does in quite a while. It seems like the  STO_AARCH64_VARIANT_PCS was an expedient for SVE when it first came out and the semantics that it defines where all the registers are caller preserved makes it very difficult to implement around.
>>>> 
>>>> For LAV_CURRENT=2 what I planned to do was:
>>> 
>>> Now that I have finished the various audit issues that John Mellor-Crummey 
>>> has brought up, I think I have a better idea of how to address it.
>> 
>> Well the tier1 ones at least. Thank you. I was working on modifying John’s reproducers and making them official glibc tests so that when your  patch series lands in 2.35 they would have tests that ensure that no regressions show up.
>> 
>> I might not have seen it yet? Did you have a new version of 2/6 or a V3 version of the patch set which handles the lazy binding issue that you pointed out in. https://sourceware.org/pipermail/libc-alpha/2021-July/129510.html 
> 
> I sent it [1].  I changed your aarch64 patch slightly, fixed a small issue, and
> added a couple of tests.  It should fixed all tier1 and mostly of tier2 issues,
> modulo the aarch64 SVE one.
> 
> [1] https://patchwork.sourceware.org/project/glibc/list/?series=2577 <https://patchwork.sourceware.org/project/glibc/list/?series=2577>

Yea!! I’ll test it with my stuff.

> 
>> 
>>> 
>>>> 
>>>> diff --git a/sysdeps/aarch64/bits/link.h b/sysdeps/aarch64/bits/link.h
>>>> index ca76087ee1..390b12a826 100644
>>>> --- a/sysdeps/aarch64/bits/link.h
>>>> +++ b/sysdeps/aarch64/bits/link.h
>>>> @@ -20,13 +20,24 @@
>>>> # error "Never include <bits/link.h> directly; use <link.h> instead."
>>>> #endif
>>>> 
>>>> +typedef struct La_sve_regs {
>>>> +  uint16_t    *lr_preg[3];
>>>> +  __uint128_t *lr_zreg[8];
>>>> +} La_sve_regs;
>>>> +
>>>> /* Registers for entry into PLT on AArch64.  */
>>>> typedef struct La_aarch64_regs
>>>> {
>>>>  uint64_t    lr_xreg[9];
>>>> -  __uint128_t lr_vreg[8];
>>>>  uint64_t    lr_sp;
>>>>  uint64_t    lr_lr;
>>>> +  char    lr_sve; /* 0 - no SVE,
>>>> +                        1-16 length of the SVE registers in vq (128 bits) */
>>>> +  union {
>>>> +    /* when sve!=0 accessing the lr_vreg is undefined */
>>>> +    __uint128_t lr_vreg[8];
>>>> +    La_sve_regs lr_zreg;
>>>> +  };
>>>> } La_aarch64_regs;
>>>> 
>>>> /* Return values for calls from PLT on AArch64.  */
>>>> @@ -34,9 +45,14 @@ typedef struct La_aarch64_retval
>>>> {
>>>>  /* Up to eight integer registers can be used for a return value.  */
>>>>  uint64_t    lrv_xreg[8];
>>>> -  /* Up to eight V registers can be used for a return value.  */
>>>> -  __uint128_t lrv_vreg[8];
>>>> -
>>>> +  char        lrv_sve; /* 0 - no SVE,
>>>> +                         1-16 length of the SVE registers in vq (128 bits) */
>>>> +  union{
>>>> +    /* Up to eight V registers can be used for a return value.
>>>> +       When sve!=0 accessing the lr_vreg is undefined */
>>>> +    __uint128_t lrv_vreg[8];
>>>> +    La_sve_regs lrv_zreg;
>>>> +  };
>>>> } La_aarch64_retval;
>>>> __BEGIN_DECLS
>>> 
>>> My idea is to do something similar:
>>> ---
>>> typedef struct La_sve_regs
>>> {
>>> uint16_t    *lr_preg;
>>> long double *lr_zreg;
>>> } La_sve_regs;
>>> 
>>> typedef struct La_aarch64_regs
>>> {
>>> [...]
>>> uint8_t  lr_sve;           /* 0 - no SVE
>>>                               1-16 length of the SVE registers in vq (128 bits)  */
>>> La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
>>> };
>> 
>> Take this with a grain of salt you are the expert not me. These are three minor issues that I don’t really like about this approach:
>> 1) I think having lr_sve be an uint8_t will mess up the alignment of the V registers that are immediately after it. It probably needs to be at the end of the structure or it needs to be padded out for alignment.
> 
> No sure if there will be alignment constraint here, since SVE instruction afaik
> can load/store unaligned memory and this is not a performance critical code. 
> In any case, that's why I added 'long double' for the 'lz_zreg' since it should
> at least set 16 bytes alignment.
> 

I wasn’t really worried about the Z registers here. I was thinking about the V registers. I tried fixing this myself a few times and one of my attempts looked like yours with a char  lr_sve and in that case I ran into alignment problems. The difference between what you are doing and what I tried is that I moved what are now the D registers (which is broken - you need to save the full 128b V register — this was copied over from V7 and wasn’t updated for 64b V8) down in the structure to after lr_sve and that is what caused the alignment problem in the assembly portion of the code.

typedef struct La_sve_regs {
  uint16_t    *lr_preg[3];
  __uint128_t *lr_zreg[8];
} La_sve_regs;

/* Registers for entry into PLT on AArch64.  */
typedef struct La_aarch64_regs
{
    uint64_t    lr_xreg[9];
    uint64_t    lr_sp;
    uint64_t    lr_lr;
=>  char    lr_sve; /* 0 - no SVE, 
                         1-16 length of the SVE registers in vq (128 bits) */

=> // Note that this is *BEFORE* the V registers 
  union {
    /* when sve!=0 accessing the lr_vreg is undefined */
    __uint128_t lr_vreg[8];
    La_sve_regs lr_zreg;
  };
} La_aarch64_regs;

>> 2) Second it appears like you have to traverse two pointers to actually get at the registers. In my opinion, it would be better if everything were just offsets from the beginning of the structure so that the actual memory for the Z and P registers is immediately following the La_aarch64_regs structure.
> 
> It is a trade-off to avoid add minimize the struct space to non-SVE code.  But
> I don't have a strong preference, we can use this instead:
> 
> typedef struct La_aarch64_regs
> {
>  [...]
>  uint8_t  lr_sve;           /* 0 - no SVE
>                                1-16 length of the SVE registers in vq (128 bits)  */
>  uint16_t    *lr_preg[8];   /* NULL - no SVE.  */
>  long double *lr_zreg[4];   /* NULL - no SVE.  */
> };
> 

You are correct, This isn’t a performance critical code path. If you are watching the pltenter/exit things are going to suck anyway.  I was trying to be too clever. I was sort of thinking: Whose going to allocate the memory for this structure - how do we make it thread safe how are we going to deal with all 100 threads coming through here all at once. So I was thinking let’s keep the whole register structure on the stack. 

With your idea my best guess is that the lr_sve_regs would be allocated immediately after the La_aarch64_regs structure on the stack and so in memory it would look like:

  uint64_t    lr_xreg[9];
  __uint128_t lr_vreg[8];
  uint64_t    lr_sp;
  uint64_t    lr_lr;
  uint8_t;    lr_sve; 
  La_sve_regs *lr_sve_regs; -> SVE_REGS
SVE_REGS:
   uint16_t    *lr_preg; -> SVE_P_REGS
   long double *lr_zreg; -> SVE_P_REGS+3*lr_sve*sizeof(uint16_t)
SVE_P_REGS;
   unit16_t lr_preg[3*lr_sve]; 
   long double lr_zreg[8*lr_sve];

If my guess about where the memory was going to be allocated is correct, then you have a bunch or pointer arithmetic for things which going to be right there anyway at static offsets.

Really it doesn’t matter, we can do it anyway you like. We just need it to work. Your way makes sense to a normal C programmer. I was trying to come up with something more clever that would take up less space which wouldn’t have all the pointer following. However, I couldn’t get it working.

>> 3) Don’t the V and the Z registers overlap in the machine architecture? Don’t we want to in some way overlap them so that you don’t have unused V register space in the structures when it is a SVE call?
>> 
>> What I would prefer to see in memory is:
>> 
>>  uint64_t lr_xreg[8];
>>  uint64_t lr_sp;
>>  uint64_t lr_lr;
>>  uint64_t  lr_sve;  // increased in size for alignment
>> 
>> then when lr_sve == 0
>>  long double lr_vreg[8];
>> or
>>  long double lr_zreg[8][lr_sve]
>>  uint16_t preg[4][lr_sve]
>> 
>> or something like that where the V or Z & P registers are immediately adjacent in memory to the rest of the normal registers. You don’t have to traverse pointers to get to them.
> 
> We can, but my idea is to make non SVE calls to have no performance or
> space implications.  Do we really need to optimize to avoid pointer
> transverse? What kind of analysis the audit module usually do?
> 
> I am asking because I foresee that any pointer transverse will only be
> a fraction of time of the PLT hook runtime itself. 

Yep that is correct.

> 
>> 
>> Could we have two la_aarch64_gnu_pltenter/exit functions? One which passes the normal La_aarch64_regs and one that passes la_aarch64_sveregs?
> 
> It might be an option, but it would add another complexity layer:
> 
>  - We would need to add arch-specific hooks on the audit internal 'auditstate'
>    interface.
> 
>  - We would need to add arch-specific loader code to handle the SVE la_pltenter
>    la_pltexit
> 
>  - It would require either a different implementation for _dl_runtime_resolve
>    to call the correct pltexit hook or a add a way to decide which hook to
>    call at runtime.  x86_64 has decided for the former and it sets the expected
>    trampoline based on the cpuid() (it would be hwcap2 for aarch64).
> 
> I am not sure which would be better, maybe adding SVE specific hooks might be
> a better approach indeed (although it would required more code).
> 
>> 
>> or would it be better to have two types, the normal version and then when lr_sve!=0 then you cast it to: La_aarch64_sve_regs which has the Z and the P regs instead.
>> 
>> 
>>> 
>>> typedef struct La_aarch64_retval
>>> {
>>> [...]
>>> uint8_t  lr_sve;           /* 0 - no SVE
>>>                               1-16 length of the SVE registers in vq (128 bits)  */
>>> La_sve_regs *lr_sve_regs;  /* NULL - no SVE.  */
>>> }
>>> ---
>>> 
>>> The _dl_runtime_resolve will be responsible to allocate on the stack the required
>>> space for the La_sve_regs and setup the La_aarch64_regs and La_aarch64_retval internal
>>> pointers.  It has the advantage of allocate only the required space and if the we 
>>> can distinguish if the symbol does use SVE we can avoid the performance issue for 
>>> symbols that do not use SVE. The downside is it would require a potential large stack 
>>> space.
>>> 
>>>> 
>>>> However, that would require toolchain support and another hint in st_other which separates SVE calls from other uses of STO_AARCH64_VARIANT_PCS like STO_AARCH64_VARIANT_SVE. Then the runtime linker could populate the lrv_sve with information from the lrv_vreg with the size of the vector registers from the processor’s registers. 
>>> 
>>> I think it should be feasible to assume now that STO_AARCH64_VARIANT_PCS 
>>> means SVE, which meant that we can use it _dl_runtime_resolve to skip
>>> the save/restore if the symbol follows the default standard procedure
>>> call.  
>> 
>> Isn’t STO_AARCH64_VARIANT_PCS used for other kinds of calls as well, ones that require all the registers to be preserved.
>> I know that it would require some toolchain changes but shouldn’t we define something like STO_AARCH64_VARIANT_SVE to differentiate the two.
> 
> Ideally yes, but it would require both compiler and binutils support. I am not
> sure if ARM is willing to move forward, and even though it would take time.
> 
> I suggested assuming STO_AARCH64_VARIANT_PCS means SVE because at least for
> profiling PLT stub, I think we can check if hwcap supports SVE and then
> assume the symbol is indeed a SVE call. It would work for *current* defined
> ABI and ISA, but if ARM starts to use it for something else, specially for
> newer architecture that also support SVE, it indeed might break.

Please double check with someone more informed than I am, I’m almost certain someone told me that it is used somewhere else for something else.

-ben

> 
>> 
>>> 
>>>> 
>>>> There are at least two problems with that approach. 
>>>> 1) who allocates the lr_zreg pointers in the la_sve_regs and how long should they be? Do they always have to be allocated to be the max size 2048 bits?
>>>> 2) I hadn’t worked out how to handle functions that return things in the SVE regs. Do we need two new flags in st_other? STO_AARCH64_VARIANT_SVE and STO_AARCH64_VARIANT_SVERET?
>>> 
>>> I don't think we need the two extra flags to handle SVE calls, but I am
>>> also assuming that STO_AARCH64_VARIANT_PCS will be used solely for SVE.
>> 
>> Can you double check with your ARM contacts regarding that? I’m fairly sure that someone either in RH, ARM, or Linaro told me that VARIANT_PCS was also used other unusual situations other than SVE.
>> 
>>> And it has the extra problem of using two extra flags is not backward
>>> compatible.
>>> 
>>> I presume that if ARM wants to push for another procedure call variant
>>> on linux-gnu I would expect another flag.
>>> 
>>>> 
>>>> Then there was the question in the future could there be: big.LITTLE processors where some big processors had SVE registers of one length while the LITTLE processors had different ones? 
>>> 
>>> Although I find this kind of setup unlikely, my expectation is either that it
>>> would be transparent to userland (either kernel will emulate the required 
>>> instructions or it will bind process with STO_AARCH64_VARIANT_PCS to cores 
>>> with SVE).
>> 
>> Can you explain how that would work?
>> 
>> You launch a binary that has SVE from any processor. Somewhere along the way this means one of the exec syscalls handles it in the kernel. The kernel notices that it is an ELF file and hands it off to the ELF interpreter ld.so. Nowhere along the way would the kernel do a deep enough inspection of the binary to notice that it has SVE instructions in it and so how would it know to bind it to a processor which has HW SVE?
>> 
>> I also don’t see any places where ld.so can call back into the kernel and say “uh move me over to a processor that has these capabilities”. 
>> 
>> What I can imagine is a LITTLE processor traps into the kernel when executing a SVE instruction just like any illegal instruction. However, the handler looks at the instruction being executed notices it is an SVE instruction and instead of delivering a SIGILL to the process, it sets a flag in the task structure that says that this process uses SVE and migrates it to one of the big processors that can execute SVE.
> 
> Indeed, the STO_AARCH64_VARIANT_PCS would not help in this case.
> But you summarize what I think it should happen: if a core executes
> the SVE instruct and traps, the kernel would either move to a SVE
> core with pc reset, emulate the instruction, or throw a SIGILL.
> In any case, I would expect it to be transparent to userland.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-07-31  0:59               ` Ben Woodard via Libc-alpha
@ 2021-08-04 18:11                 ` Adhemerval Zanella via Libc-alpha
  2021-08-05 10:32                   ` Szabolcs Nagy via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-08-04 18:11 UTC (permalink / raw)
  To: Ben Woodard
  Cc: Florian Weimer, Szabolcs Nagy, John Mellor-Crummey, libc-alpha



On 30/07/2021 21:59, Ben Woodard wrote:
>>>
>>> Take this with a grain of salt you are the expert not me. These are three minor issues that I don’t really like about this approach:
>>> 1) I think having lr_sve be an uint8_t will mess up the alignment of the V registers that are immediately after it. It probably needs to be at the end of the structure or it needs to be padded out for alignment.
>>
>> No sure if there will be alignment constraint here, since SVE instruction afaik
>> can load/store unaligned memory and this is not a performance critical code. 
>> In any case, that's why I added 'long double' for the 'lz_zreg' since it should
>> at least set 16 bytes alignment.
>>
> 
> I wasn’t really worried about the Z registers here. I was thinking about the V registers. I tried fixing this myself a few times and one of my attempts looked like yours with a char  lr_sve and in that case I ran into alignment problems. The difference between what you are doing and what I tried is that I moved what are now the D registers (which is broken - you need to save the full 128b V register — this was copied over from V7 and wasn’t updated for 64b V8) down in the structure to after lr_sve and that is what caused the alignment problem in the assembly portion of the code.
> 
> typedef struct La_sve_regs {
>   uint16_t    *lr_preg[3];
>   __uint128_t *lr_zreg[8];
> } La_sve_regs;
> 
> /* Registers for entry into PLT on AArch64.  */
> typedef struct La_aarch64_regs
> {
>     uint64_t    lr_xreg[9];
>     uint64_t    lr_sp;
>     uint64_t    lr_lr;
> =>  char    lr_sve; /* 0 - no SVE, 
>                          1-16 length of the SVE registers in vq (128 bits) */
> 
> => // Note that this is *BEFORE* the V registers 
>   union {
>     /* when sve!=0 accessing the lr_vreg is undefined */
>     __uint128_t lr_vreg[8];
>     La_sve_regs lr_zreg;
>   };
> } La_aarch64_regs;
> 
>>> 2) Second it appears like you have to traverse two pointers to actually get at the registers. In my opinion, it would be better if everything were just offsets from the beginning of the structure so that the actual memory for the Z and P registers is immediately following the La_aarch64_regs structure.
>>
>> It is a trade-off to avoid add minimize the struct space to non-SVE code.  But
>> I don't have a strong preference, we can use this instead:
>>
>> typedef struct La_aarch64_regs
>> {
>>  [...]
>>  uint8_t  lr_sve;           /* 0 - no SVE
>>                                1-16 length of the SVE registers in vq (128 bits)  */
>>  uint16_t    *lr_preg[8];   /* NULL - no SVE.  */
>>  long double *lr_zreg[4];   /* NULL - no SVE.  */
>> };
>>
> 
> You are correct, This isn’t a performance critical code path. If you are watching the pltenter/exit things are going to suck anyway.  I was trying to be too clever. I was sort of thinking: Whose going to allocate the memory for this structure - how do we make it thread safe how are we going to deal with all 100 threads coming through here all at once. So I was thinking let’s keep the whole register structure on the stack. 
> 
> With your idea my best guess is that the lr_sve_regs would be allocated immediately after the La_aarch64_regs structure on the stack and so in memory it would look like:
> 
>   uint64_t    lr_xreg[9];
>   __uint128_t lr_vreg[8];
>   uint64_t    lr_sp;
>   uint64_t    lr_lr;
>   uint8_t;    lr_sve; 
>   La_sve_regs *lr_sve_regs; -> SVE_REGS
> SVE_REGS:
>    uint16_t    *lr_preg; -> SVE_P_REGS
>    long double *lr_zreg; -> SVE_P_REGS+3*lr_sve*sizeof(uint16_t)
> SVE_P_REGS;
>    unit16_t lr_preg[3*lr_sve]; 
>    long double lr_zreg[8*lr_sve];
> 
> If my guess about where the memory was going to be allocated is correct, then you have a bunch or pointer arithmetic for things which going to be right there anyway at static offsets.
> 
> Really it doesn’t matter, we can do it anyway you like. We just need it to work. Your way makes sense to a normal C programmer. I was trying to come up with something more clever that would take up less space which wouldn’t have all the pointer following. However, I couldn’t get it working.

I updated my branch with a POC patch to support SVE for rtld-audit [1].
In the end the layout I ended up using is:

  typedef union
  {
    float s;
    double d;
    long double q;
    long double *z;
  } La_aarch64_vector;

  /* Registers for entry into PLT on AArch64.  */
  typedef struct La_aarch64_regs
  {
    uint64_t          lr_xreg[9];
    La_aarch64_vector lr_vreg[8];
    uint64_t          lr_sp;
    uint64_t          lr_lr;
    uint8_t           lr_sve;
    uint16_t          *lr_sve_pregs[4];
  } La_aarch64_regs;

  /* Return values for calls from PLT on AArch64.  */
  typedef struct La_aarch64_retval
  {
    /* Up to eight integer registers can be used for a return value.  */
    uint64_t          lrv_xreg[8];
    /* Up to eight V registers can be used for a return value.  */
    La_aarch64_vector lrv_vreg[8];
    uint8_t           lrv_sve;
  } La_aarch64_retval;

It means the if 'lr_sve' is 0 in either La_aarch64_regs or La_aarch64_retval
the La_aarch64_vector contains the floating-pointer registers that can be
accessed directly.  Otherwise, 'La_aarch64_vector.z' points to a memory area
that holds up to 'lr_sve' bytes from the Z registers, which can be loaded
with svld1 intrinsic for instance (as tst-audit28.c does). The P register
follows the same logic, with each La_aarch64_regs.lr_sve_pregs pointing
to an area of memory 'lr_sve/8' in size.

So, to access the FP register as float you can use:

  static inline float regs_vec_to_float (const La_aarch64_regs *regs, int i)
  {
    float r;
    if (regs->lr_sve == 0)
      r = regs->lr_vreg[i].s;
    else
      memcpy (&r, &regs->lr_vreg[i].z[0], sizeof (r));
    return r;
  }

To implement it I had to setup lazy binding when profiling or auditing is
used, even when STO_AARCH64_VARIANT_PCS is being set.  Also, to not incur
in performance penalties on default non-SVE configuration, the patch uses
a new PTL entrypoint, _dl_runtime_profile_sve, which is used iff 'hwcap'
has HWCAP_SVE bit set.

I think this is a fair assumption since SVE has a defined set of registers
for argument passing and return values.  A new ABI with either different
argument passing or different registers would require a different PLT
entry, but I assume this would require another symbol flag anyway (or
at least a different ELF mark to indicate so).

For this POC, the profile '_dl_runtime_profile_sve' entrypoint assumes
the largest SVE register size possible (2048 bits) and thus it requires
a quite large stack (8976 bytes).  I think it would be possible make the
stack requirement dynamic depending of the vector length, but it would
make the PLT audit function way more complex.

This patch is not complete yet: the tst-audit28 does not check if compiler
supports SVE (we would need a configure check to disable for such case),
I need to add a proper comment for the _dl_runtime_profile_sve
stack layout, the test need to check for the P register state clobbering.

I also haven't check the performance penalties with this approach, and
maybe the way I am saving/restoring the SVE register might be optimized.

In any case, I checked on a SVE machine and at least the testcase work
as expected without any regressions.  I also did a sniff test on a non SVE
machine.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ld-audit-fixes

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-08-04 18:11                 ` Adhemerval Zanella via Libc-alpha
@ 2021-08-05 10:32                   ` Szabolcs Nagy via Libc-alpha
  2021-08-05 19:36                     ` Ben Woodard via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-08-05 10:32 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha

The 08/04/2021 15:11, Adhemerval Zanella wrote:
> I updated my branch with a POC patch to support SVE for rtld-audit [1].
> In the end the layout I ended up using is:
> 
>   typedef union
>   {
>     float s;
>     double d;
>     long double q;
>     long double *z;
>   } La_aarch64_vector;
> 
>   /* Registers for entry into PLT on AArch64.  */
>   typedef struct La_aarch64_regs
>   {
>     uint64_t          lr_xreg[9];
>     La_aarch64_vector lr_vreg[8];
>     uint64_t          lr_sp;
>     uint64_t          lr_lr;
>     uint8_t           lr_sve;
>     uint16_t          *lr_sve_pregs[4];
>   } La_aarch64_regs;
> 
>   /* Return values for calls from PLT on AArch64.  */
>   typedef struct La_aarch64_retval
>   {
>     /* Up to eight integer registers can be used for a return value.  */
>     uint64_t          lrv_xreg[8];
>     /* Up to eight V registers can be used for a return value.  */
>     La_aarch64_vector lrv_vreg[8];
>     uint8_t           lrv_sve;
>   } La_aarch64_retval;
> 
> It means the if 'lr_sve' is 0 in either La_aarch64_regs or La_aarch64_retval
> the La_aarch64_vector contains the floating-pointer registers that can be
> accessed directly.  Otherwise, 'La_aarch64_vector.z' points to a memory area
> that holds up to 'lr_sve' bytes from the Z registers, which can be loaded
> with svld1 intrinsic for instance (as tst-audit28.c does). The P register
> follows the same logic, with each La_aarch64_regs.lr_sve_pregs pointing
> to an area of memory 'lr_sve/8' in size.

i'd try a more generic extension mechanism like in
the linux sigcontext struct. then it's less likely
that existing plt hook code needs to change when
new register state is present and used.

and i think we need to handle variant pcs in a
generic way: we don't know if that's sve pcs or not.

for example

struct {
  uint64_t lr_x[9];
  uint64_t lr_lr;
  uint64_t lr_sp;
  uint64_t lr_flags; // e.g. is this a variant PCS call?
  vreg_t lr_v[8];
  struct extension *lr_ext;
};

struct extension {
  struct extension *next;
  uint32_t type; // e.g. sve extension
  uint32_t len; // can copy the contents even for unknown type
};

struct xreg_extension {
  struct extension header;
  uint64_t x[30];
};

struct vreg_extension {
  struct extension header;
  vreg_t v[24];
};

struct sve_extension {
  struct extension header;
  uint16_t vl;
  zreg_t *z[32];
  preg_t *p[16];
  char data[];
};

> 
> So, to access the FP register as float you can use:
> 
>   static inline float regs_vec_to_float (const La_aarch64_regs *regs, int i)
>   {
>     float r;
>     if (regs->lr_sve == 0)
>       r = regs->lr_vreg[i].s;
>     else
>       memcpy (&r, &regs->lr_vreg[i].z[0], sizeof (r));
>     return r;
>   }
> 
> To implement it I had to setup lazy binding when profiling or auditing is
> used, even when STO_AARCH64_VARIANT_PCS is being set.  Also, to not incur
> in performance penalties on default non-SVE configuration, the patch uses
> a new PTL entrypoint, _dl_runtime_profile_sve, which is used iff 'hwcap'
> has HWCAP_SVE bit set.

variant pcs does not mean 'sve call' it means 'arbitrary pcs'.
so we have to save all registers.

and a base pcs call does not have to preserve sve state so
we don't need to save the z regs even if sve is present.

the main difficulty i see is that we cannot easily tell in
a plt entry if it is for a variant pcs symbol: you have to
look at the symbol table entry using the symbol index from
the relocation. usually such code is in c, but c code does
not preserve all registers, so here it has to be in asm.
the clean way would be a different entrypoint for variant
pcs calls, but that requires linker changes (another PLT0
like construct where variant pcs PLT can go).

alternatively use _dl_runtime_profile_vpcs entrypoint for
an elf module that has DT_AARCH64_VARIANT_PCS set and then
always save all registers present. then the default entry
point does not need to deal with extensions. this may be
slow for some hpc usecases.

> 
> I think this is a fair assumption since SVE has a defined set of registers
> for argument passing and return values.  A new ABI with either different
> argument passing or different registers would require a different PLT
> entry, but I assume this would require another symbol flag anyway (or
> at least a different ELF mark to indicate so).
> 
> For this POC, the profile '_dl_runtime_profile_sve' entrypoint assumes
> the largest SVE register size possible (2048 bits) and thus it requires
> a quite large stack (8976 bytes).  I think it would be possible make the
> stack requirement dynamic depending of the vector length, but it would
> make the PLT audit function way more complex.

yeah, i think we need to understand how the plt hooks are
used: do they actually look at these registers? or they only
need the registers to be preserved? we may not need easy
access to the z reg contents.

> 
> This patch is not complete yet: the tst-audit28 does not check if compiler
> supports SVE (we would need a configure check to disable for such case),
> I need to add a proper comment for the _dl_runtime_profile_sve
> stack layout, the test need to check for the P register state clobbering.
> 
> I also haven't check the performance penalties with this approach, and
> maybe the way I am saving/restoring the SVE register might be optimized.
> 
> In any case, I checked on a SVE machine and at least the testcase work
> as expected without any regressions.  I also did a sniff test on a non SVE
> machine.
> 
> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ld-audit-fixes

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-08-05 10:32                   ` Szabolcs Nagy via Libc-alpha
@ 2021-08-05 19:36                     ` Ben Woodard via Libc-alpha
  2021-08-06  9:04                       ` Szabolcs Nagy via Libc-alpha
  0 siblings, 1 reply; 23+ messages in thread
From: Ben Woodard via Libc-alpha @ 2021-08-05 19:36 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha



> On Aug 5, 2021, at 3:32 AM, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> 
> The 08/04/2021 15:11, Adhemerval Zanella wrote:
>> I updated my branch with a POC patch to support SVE for rtld-audit [1].
>> In the end the layout I ended up using is:
>> 
>>  typedef union
>>  {
>>    float s;
>>    double d;
>>    long double q;
>>    long double *z;
>>  } La_aarch64_vector;
>> 
>>  /* Registers for entry into PLT on AArch64.  */
>>  typedef struct La_aarch64_regs
>>  {
>>    uint64_t          lr_xreg[9];
>>    La_aarch64_vector lr_vreg[8];
>>    uint64_t          lr_sp;
>>    uint64_t          lr_lr;
>>    uint8_t           lr_sve;
>>    uint16_t          *lr_sve_pregs[4];
>>  } La_aarch64_regs;
>> 
>>  /* Return values for calls from PLT on AArch64.  */
>>  typedef struct La_aarch64_retval
>>  {
>>    /* Up to eight integer registers can be used for a return value.  */
>>    uint64_t          lrv_xreg[8];
>>    /* Up to eight V registers can be used for a return value.  */
>>    La_aarch64_vector lrv_vreg[8];
>>    uint8_t           lrv_sve;
>>  } La_aarch64_retval;
>> 
>> It means the if 'lr_sve' is 0 in either La_aarch64_regs or La_aarch64_retval
>> the La_aarch64_vector contains the floating-pointer registers that can be
>> accessed directly.  Otherwise, 'La_aarch64_vector.z' points to a memory area
>> that holds up to 'lr_sve' bytes from the Z registers, which can be loaded
>> with svld1 intrinsic for instance (as tst-audit28.c does). The P register
>> follows the same logic, with each La_aarch64_regs.lr_sve_pregs pointing
>> to an area of memory 'lr_sve/8' in size.
> 
> i'd try a more generic extension mechanism like in
> the linux sigcontext struct. then it's less likely
> that existing plt hook code needs to change when
> new register state is present and used.
> 
> and i think we need to handle variant pcs in a
> generic way: we don't know if that's sve pcs or not.
> 
> for example
> 
> struct {
>  uint64_t lr_x[9];
>  uint64_t lr_lr;
>  uint64_t lr_sp;
>  uint64_t lr_flags; // e.g. is this a variant PCS call?
>  vreg_t lr_v[8];
>  struct extension *lr_ext;
> };

What do you think about union’ing the lr_v and the lr_ext? I really did like that about Adhemeral’s approach.
> 
> struct extension {
>  struct extension *next;
>  uint32_t type; // e.g. sve extension
>  uint32_t len; // can copy the contents even for unknown type
> };
> 
> struct xreg_extension {
>  struct extension header;
>  uint64_t x[30];
> };
> 
> struct vreg_extension {
>  struct extension header;
>  vreg_t v[24];
> };
> 
> struct sve_extension {
>  struct extension header;
>  uint16_t vl;
>  zreg_t *z[32];
>  preg_t *p[16];
>  char data[];
> };

I like your idea of defining an extensible structure. Casting pointers is going to make some compilers complain but when you are doing system programming doing type polymorphism this way is kind of normal I guess.
> 
>> 
>> So, to access the FP register as float you can use:
>> 
>>  static inline float regs_vec_to_float (const La_aarch64_regs *regs, int i)
>>  {
>>    float r;
>>    if (regs->lr_sve == 0)
>>      r = regs->lr_vreg[i].s;
>>    else
>>      memcpy (&r, &regs->lr_vreg[i].z[0], sizeof (r));
>>    return r;
>>  }
>> 
>> To implement it I had to setup lazy binding when profiling or auditing is
>> used, even when STO_AARCH64_VARIANT_PCS is being set.  Also, to not incur
>> in performance penalties on default non-SVE configuration, the patch uses
>> a new PTL entrypoint, _dl_runtime_profile_sve, which is used iff 'hwcap'
>> has HWCAP_SVE bit set.
> 
> variant pcs does not mean 'sve call' it means 'arbitrary pcs'.
> so we have to save all registers.
> 
> and a base pcs call does not have to preserve sve state so
> we don't need to save the z regs even if sve is present.
> 

Yeah honestly this is why I really believe that we need to define STO_AARCH64_VARIANT_SVE (or something like that) and go through the pain of changing the compilers and binutils. I do not believe that my users really will care about the full breadth of STO_AARCH64_VARIANT_PCS and all the potential strange ABI variants where all the registers must be saved. I also know that they will be willing to recompile any code which uses inter-object calls that pass parameters using SVE registers in the case where they want or need to use performance tools which will go through the audit interface.

SVE parameter passing is defined in the AAPCS https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#parameter-passing-rules <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#parameter-passing-rules> while all the other PCS variants don’t obey the AAPCS. I would argue that using STO_AARCH64_VARIANT_PCS for inter-object SVE calls was a expedient kludge that needs to be backed out.

Having a design that can ultimately accommodate all the variants including ones that do not obey the AAPCS is great. However, the only problem we need to solve in the reasonably near future is being able to audit SVE calls. If and when there are changes to the AAPCS, having an extensible design such as yours, we can handle those too. (For example: I cannot tell yet if SME https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/scalable-matrix-extension-armv9-a-architecture will ultimately lead to changes to the AAPCS but if so, the changes do not appear to have landed yet.)

> the main difficulty i see is that we cannot easily tell in
> a plt entry if it is for a variant pcs symbol: you have to
> look at the symbol table entry using the symbol index from
> the relocation. usually such code is in c, but c code does
> not preserve all registers, so here it has to be in asm.
> the clean way would be a different entrypoint for variant
> pcs calls, but that requires linker changes (another PLT0
> like construct where variant pcs PLT can go).
> 
> alternatively use _dl_runtime_profile_vpcs entrypoint for
> an elf module that has DT_AARCH64_VARIANT_PCS set and then
> always save all registers present. then the default entry
> point does not need to deal with extensions. this may be
> slow for some hpc usecases.

uuugh — my users are all HPC users.
Really, it comes down to I’m not a fan of DT_AARCH64_VARIANT_PCS for SVE inter-object procedure calls. I would sort of like to leave it unauditable and just move SVE into its own variant. However, other than SVE I do not know of any other uses of DT_AARCH64_VARIANT_PCS. Writing assembly where all the registers must be preserved is such a pain.

> 
>> 
>> I think this is a fair assumption since SVE has a defined set of registers
>> for argument passing and return values.  A new ABI with either different
>> argument passing or different registers would require a different PLT
>> entry, but I assume this would require another symbol flag anyway (or
>> at least a different ELF mark to indicate so).
>> 
>> For this POC, the profile '_dl_runtime_profile_sve' entrypoint assumes
>> the largest SVE register size possible (2048 bits) and thus it requires
>> a quite large stack (8976 bytes).  I think it would be possible make the
>> stack requirement dynamic depending of the vector length, but it would
>> make the PLT audit function way more complex.
> 
> yeah, i think we need to understand how the plt hooks are
> used: do they actually look at these registers? or they only
> need the registers to be preserved? we may not need easy
> access to the z reg contents.
> 

Tools that actually use the PLT hooks are very rare — even more rare than tools that use LD_AUDIT. The ones that I have seen use that interface to inspect and modify parameters and return values. So I would argue that providing RW access to the Z registers is required. 

The two broad categories that I have seen were either trivial and used for basic curiosity and debugging (kind of like a targeted ltrace) and ones that worked around bugs in the libraries or did additional security checks not implemented in the library. The latter seems to be what the PLT portion of the LD_AUDIT interface was designed to do. Quickly coding up security bandaids to detect and prevent exploitation until the library could be properly fixed, has been really handy a couple of times. For example, I wrote one a few years ago that detected prevented a buffer overflow in library where fixing the underlying problem in the library required an API change that would have required broader refactoring of the application. (IIRC that was libpng or some graphics format handling library - the effectiveness of this solution was limited by the fact that glibc and binutils didn’t yet implement DEPAUDIT.)


>> 
>> This patch is not complete yet: the tst-audit28 does not check if compiler
>> supports SVE (we would need a configure check to disable for such case),
>> I need to add a proper comment for the _dl_runtime_profile_sve
>> stack layout, the test need to check for the P register state clobbering.
>> 
>> I also haven't check the performance penalties with this approach, and
>> maybe the way I am saving/restoring the SVE register might be optimized.
>> 
>> In any case, I checked on a SVE machine and at least the testcase work
>> as expected without any regressions.  I also did a sniff test on a non SVE
>> machine.
>> 
>> [1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/ld-audit-fixes


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list)
  2021-08-05 19:36                     ` Ben Woodard via Libc-alpha
@ 2021-08-06  9:04                       ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 23+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-08-06  9:04 UTC (permalink / raw)
  To: Ben Woodard; +Cc: Florian Weimer, John Mellor-Crummey, libc-alpha

The 08/05/2021 12:36, Ben Woodard wrote:
> > On Aug 5, 2021, at 3:32 AM, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > variant pcs does not mean 'sve call' it means 'arbitrary pcs'.
> > so we have to save all registers.
> > 
> > and a base pcs call does not have to preserve sve state so
> > we don't need to save the z regs even if sve is present.
> > 
> 
> Yeah honestly this is why I really believe that we need to define STO_AARCH64_VARIANT_SVE (or something like that) and go through the pain of changing the compilers and binutils. I do not believe that my users really will care about the full breadth of STO_AARCH64_VARIANT_PCS and all the potential strange ABI variants where all the registers must be saved. I also know that they will be willing to recompile any code which uses inter-object calls that pass parameters using SVE registers in the case where they want or need to use performance tools which will go through the audit interface.
> 
> SVE parameter passing is defined in the AAPCS https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#parameter-passing-rules <https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#parameter-passing-rules> while all the other PCS variants don’t obey the AAPCS. I would argue that using STO_AARCH64_VARIANT_PCS for inter-object SVE calls was a expedient kludge that needs to be backed out.
> 
> Having a design that can ultimately accommodate all the variants including ones that do not obey the AAPCS is great. However, the only problem we need to solve in the reasonably near future is being able to audit SVE calls. If and when there are changes to the AAPCS, having an extensible design such as yours, we can handle those too. (For example: I cannot tell yet if SME https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/scalable-matrix-extension-armv9-a-architecture will ultimately lead to changes to the AAPCS but if so, the changes do not appear to have landed yet.)

we thought about marking the exact pcs separately, but

1) we did not want to reserve the available STO_ bits
   for pcs variants (there are only 6 bits and many
   potential future extensions and pcs variants)

2) we did not think that plt hooks will want to know
   the pcs in more detail than just marking non-base
   calls. plt hooks seemed to me unreliable and very
   rarely used.

if we add STO_AARCH64_SVE_PCS then we need
_ADVSIMD_VECTOR_PCS too and possibly SME variants soon.
and each variant requires their own PLT0 entry: the
STO_ flag can be checked at load time but not easily
at plt entry time, so separate plt entries are needed.
and each new variant requires a glibc update. this
approach was on the table, but i considered it overkill
when we had no usecases.

> > alternatively use _dl_runtime_profile_vpcs entrypoint for
> > an elf module that has DT_AARCH64_VARIANT_PCS set and then
> > always save all registers present. then the default entry
> > point does not need to deal with extensions. this may be
> > slow for some hpc usecases.
> 
> uuugh — my users are all HPC users.

i think we need to do some measurements how much overhead
it is to save/restore all regs for all calls.

and estimate how common it is to have elf modules with
sve calls in a hpc setting.

> Really, it comes down to I’m not a fan of DT_AARCH64_VARIANT_PCS for SVE inter-object procedure calls. I would sort of like to leave it unauditable and just move SVE into its own variant. However, other than SVE I do not know of any other uses of DT_AARCH64_VARIANT_PCS. Writing assembly where all the registers must be preserved is such a pain.

i only want to do the SVE marking if we have no solution
with existing elf abi.

> > yeah, i think we need to understand how the plt hooks are
> > used: do they actually look at these registers? or they only
> > need the registers to be preserved? we may not need easy
> > access to the z reg contents.
> > 
> 
> Tools that actually use the PLT hooks are very rare — even more rare than tools that use LD_AUDIT. The ones that I have seen use that interface to inspect and modify parameters and return values. So I would argue that providing RW access to the Z registers is required. 
> 
> The two broad categories that I have seen were either trivial and used for basic curiosity and debugging (kind of like a targeted ltrace) and ones that worked around bugs in the libraries or did additional security checks not implemented in the library. The latter seems to be what the PLT portion of the LD_AUDIT interface was designed to do. Quickly coding up security bandaids to detect and prevent exploitation until the library could be properly fixed, has been really handy a couple of times. For example, I wrote one a few years ago that detected prevented a buffer overflow in library where fixing the underlying problem in the library required an API change that would have required broader refactoring of the application. (IIRC that was libpng or some graphics format handling library - the effectiveness of this solution was limited by the fact that glibc and binutils didn’t yet implement DEPAUDIT.)

i see, i think these can work with variant pcs,
the main question is the overhead.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-08-06  9:05 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-16 17:55 A collection of LD_AUDIT bugs that are important for tools (with better formatting for this list) John Mellor-Crummey via Libc-alpha
2021-06-17 19:42 ` Adhemerval Zanella via Libc-alpha
2021-06-17 20:09   ` Florian Weimer via Libc-alpha
2021-06-17 23:06     ` Adhemerval Zanella via Libc-alpha
2021-06-23 17:42       ` Ben Woodard via Libc-alpha
2021-07-30 14:58         ` Adhemerval Zanella via Libc-alpha
2021-07-30 18:59           ` Ben Woodard via Libc-alpha
2021-07-30 21:09             ` Adhemerval Zanella via Libc-alpha
2021-07-31  0:59               ` Ben Woodard via Libc-alpha
2021-08-04 18:11                 ` Adhemerval Zanella via Libc-alpha
2021-08-05 10:32                   ` Szabolcs Nagy via Libc-alpha
2021-08-05 19:36                     ` Ben Woodard via Libc-alpha
2021-08-06  9:04                       ` Szabolcs Nagy via Libc-alpha
2021-06-21 19:42     ` John Mellor-Crummey via Libc-alpha
2021-06-22  8:15       ` Florian Weimer via Libc-alpha
2021-06-22 15:04         ` John Mellor-Crummey via Libc-alpha
2021-06-22 15:36           ` Florian Weimer via Libc-alpha
2021-06-22 16:17             ` John Mellor-Crummey via Libc-alpha
2021-06-22 16:33               ` Adhemerval Zanella via Libc-alpha
2021-06-23  6:32                 ` Florian Weimer via Libc-alpha
2021-06-23 20:06                   ` Mark Krentel via Libc-alpha
2021-06-18 17:48   ` John Mellor-Crummey via Libc-alpha
2021-06-18 18:27     ` Adhemerval Zanella via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).