RFC: Disable clone3 for glibc 2.34

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* RFC: Disable clone3 for glibc 2.34
@ 2021-07-27  8:43 Florian Weimer via Libc-alpha
  2021-07-27  9:11 ` Florian Weimer via Libc-alpha
  2021-07-27 23:07 ` Andreas K. Huettel via Libc-alpha
  0 siblings, 2 replies; 17+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-07-27  8:43 UTC (permalink / raw)
  To: libc-alpha

Reportedly, the docker package in Ubuntu as used by Github Actions and
others does not provide a way to enable the clone3 system call.  It
always fails with EPERM.

Should we apply a patch like this for the release?

diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
index 1e7a8f6b35..4046c81180 100644
--- a/sysdeps/unix/sysv/linux/clone-internal.c
+++ b/sysdeps/unix/sysv/linux/clone-internal.c
@@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
 		  int (*func) (void *arg), void *arg)
 {
   int ret;
-#ifdef HAVE_CLONE3_WAPPER
-  /* Try clone3 first.  */
-  int saved_errno = errno;
-  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
-  if (ret != -1 || errno != ENOSYS)
-    return ret;
-
-  /* NB: Restore errno since errno may be checked against non-zero
-     return value.  */
-  __set_errno (saved_errno);
-#endif
 
   /* Map clone3 arguments to clone arguments.  NB: No need to check
      invalid clone3 specific bits in flags nor exit_signal since this

My concern with this is that we don't know yet where the CET kernel API
will land exactly and if CET will require clone3.  So clone3 might have
to come back once we turn on CET, which is hopefully soon.

Thanks,
Florian


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27  8:43 RFC: Disable clone3 for glibc 2.34 Florian Weimer via Libc-alpha
@ 2021-07-27  9:11 ` Florian Weimer via Libc-alpha
  2021-07-27  9:24   ` Christian Brauner
  2021-07-27 23:07 ` Andreas K. Huettel via Libc-alpha
  1 sibling, 1 reply; 17+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-07-27  9:11 UTC (permalink / raw)
  To: Florian Weimer via Libc-alpha

* Florian Weimer via Libc-alpha:

> Reportedly, the docker package in Ubuntu as used by Github Actions and
> others does not provide a way to enable the clone3 system call.  It
> always fails with EPERM.
>
> Should we apply a patch like this for the release?
>
> diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> index 1e7a8f6b35..4046c81180 100644
> --- a/sysdeps/unix/sysv/linux/clone-internal.c
> +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
>  		  int (*func) (void *arg), void *arg)
>  {
>    int ret;
> -#ifdef HAVE_CLONE3_WAPPER
> -  /* Try clone3 first.  */
> -  int saved_errno = errno;
> -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> -  if (ret != -1 || errno != ENOSYS)
> -    return ret;
> -
> -  /* NB: Restore errno since errno may be checked against non-zero
> -     return value.  */
> -  __set_errno (saved_errno);
> -#endif
>  
>    /* Map clone3 arguments to clone arguments.  NB: No need to check
>       invalid clone3 specific bits in flags nor exit_signal since this
>
> My concern with this is that we don't know yet where the CET kernel API
> will land exactly and if CET will require clone3.  So clone3 might have
> to come back once we turn on CET, which is hopefully soon.

Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
issue with its docker.io/containerd/runc packages.

I could trivially fix a previously failing Github Action with:

diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
index d2381ec..7b10286 100644
--- a/.github/workflows/fedora.yml
+++ b/.github/workflows/fedora.yml
@@ -22,6 +22,7 @@ jobs:
     runs-on: ubuntu-latest
     container:
       image: fedora:${{matrix.release}}
+      options: --security-opt seccomp=unconfined
 
     steps:
       - name: Checkout repository

So I think we need to figure out what people are actually complaining
about.

Thanks,
Florian


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27  9:11 ` Florian Weimer via Libc-alpha
@ 2021-07-27  9:24   ` Christian Brauner
  2021-07-27  9:41     ` Christian Brauner
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-07-27  9:24 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Florian Weimer via Libc-alpha

On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote:
> * Florian Weimer via Libc-alpha:
> 
> > Reportedly, the docker package in Ubuntu as used by Github Actions and
> > others does not provide a way to enable the clone3 system call.  It
> > always fails with EPERM.
> >
> > Should we apply a patch like this for the release?
> >
> > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> > index 1e7a8f6b35..4046c81180 100644
> > --- a/sysdeps/unix/sysv/linux/clone-internal.c
> > +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
> >  		  int (*func) (void *arg), void *arg)
> >  {
> >    int ret;
> > -#ifdef HAVE_CLONE3_WAPPER
> > -  /* Try clone3 first.  */
> > -  int saved_errno = errno;
> > -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > -  if (ret != -1 || errno != ENOSYS)
> > -    return ret;
> > -
> > -  /* NB: Restore errno since errno may be checked against non-zero
> > -     return value.  */
> > -  __set_errno (saved_errno);
> > -#endif
> >  
> >    /* Map clone3 arguments to clone arguments.  NB: No need to check
> >       invalid clone3 specific bits in flags nor exit_signal since this
> >
> > My concern with this is that we don't know yet where the CET kernel API
> > will land exactly and if CET will require clone3.  So clone3 might have
> > to come back once we turn on CET, which is hopefully soon.
> 
> Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
> issue with its docker.io/containerd/runc packages.
> 
> I could trivially fix a previously failing Github Action with:
> 
> diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
> index d2381ec..7b10286 100644
> --- a/.github/workflows/fedora.yml
> +++ b/.github/workflows/fedora.yml
> @@ -22,6 +22,7 @@ jobs:
>      runs-on: ubuntu-latest
>      container:
>        image: fedora:${{matrix.release}}
> +      options: --security-opt seccomp=unconfined
>  
>      steps:
>        - name: Checkout repository
> 
> So I think we need to figure out what people are actually complaining
> about.

This relates to the discussion what errno value should be used in a
seccomp filter to indicate that a syscall is blocked.

So there are two problems I see with seccomp and clone3():
1. the profile doesn't include clone3() at all and therefore the syscall
   is blocked and the default action is EPERM
2. the profile does include clone3() and decided to block it but the
   runtime has decided to make seccomp return EPERM and not ENOSYS when
   clone3() is attempted

The correct fix in both scenarios is to add clone3() to the seccomp
profile and either allow it or return ENOSYS.

Note that this ENOSYS/EPERM problem is a general problem. Not just glibc
doesn't know when to fallback gracefully other tools don't know either.
Application container usually just get lucky because their applications
don't need to issue the syscalls that are blocked. On a generic system
container with systemd inside this is always an issue and not using
ENOSYS is guaranteed to fail across the board.

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27  9:24   ` Christian Brauner
@ 2021-07-27  9:41     ` Christian Brauner
  2021-07-27 10:22       ` Aleksa Sarai
  0 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2021-07-27  9:41 UTC (permalink / raw)
  To: Aleksa Sarai; +Cc: Florian Weimer, Florian Weimer via Libc-alpha

On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote:
> On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote:
> > * Florian Weimer via Libc-alpha:
> > 
> > > Reportedly, the docker package in Ubuntu as used by Github Actions and
> > > others does not provide a way to enable the clone3 system call.  It
> > > always fails with EPERM.
> > >
> > > Should we apply a patch like this for the release?
> > >
> > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> > > index 1e7a8f6b35..4046c81180 100644
> > > --- a/sysdeps/unix/sysv/linux/clone-internal.c
> > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
> > >  		  int (*func) (void *arg), void *arg)
> > >  {
> > >    int ret;
> > > -#ifdef HAVE_CLONE3_WAPPER
> > > -  /* Try clone3 first.  */
> > > -  int saved_errno = errno;
> > > -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > > -  if (ret != -1 || errno != ENOSYS)
> > > -    return ret;
> > > -
> > > -  /* NB: Restore errno since errno may be checked against non-zero
> > > -     return value.  */
> > > -  __set_errno (saved_errno);
> > > -#endif
> > >  
> > >    /* Map clone3 arguments to clone arguments.  NB: No need to check
> > >       invalid clone3 specific bits in flags nor exit_signal since this
> > >
> > > My concern with this is that we don't know yet where the CET kernel API
> > > will land exactly and if CET will require clone3.  So clone3 might have
> > > to come back once we turn on CET, which is hopefully soon.
> > 
> > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
> > issue with its docker.io/containerd/runc packages.
> > 
> > I could trivially fix a previously failing Github Action with:
> > 
> > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
> > index d2381ec..7b10286 100644
> > --- a/.github/workflows/fedora.yml
> > +++ b/.github/workflows/fedora.yml
> > @@ -22,6 +22,7 @@ jobs:
> >      runs-on: ubuntu-latest
> >      container:
> >        image: fedora:${{matrix.release}}
> > +      options: --security-opt seccomp=unconfined
> >  
> >      steps:
> >        - name: Checkout repository
> > 
> > So I think we need to figure out what people are actually complaining
> > about.
> 
> This relates to the discussion what errno value should be used in a
> seccomp filter to indicate that a syscall is blocked.
> 
> So there are two problems I see with seccomp and clone3():
> 1. the profile doesn't include clone3() at all and therefore the syscall
>    is blocked and the default action is EPERM
> 2. the profile does include clone3() and decided to block it but the
>    runtime has decided to make seccomp return EPERM and not ENOSYS when
>    clone3() is attempted
> 
> The correct fix in both scenarios is to add clone3() to the seccomp
> profile and either allow it or return ENOSYS.
> 
> Note that this ENOSYS/EPERM problem is a general problem. Not just glibc
> doesn't know when to fallback gracefully other tools don't know either.
> Application container usually just get lucky because their applications
> don't need to issue the syscalls that are blocked. On a generic system
> container with systemd inside this is always an issue and not using
> ENOSYS is guaranteed to fail across the board.

Aleksa, this is fixed in runC, right?

Christian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27  9:41     ` Christian Brauner
@ 2021-07-27 10:22       ` Aleksa Sarai
  2021-07-27 10:48         ` Szabolcs Nagy via Libc-alpha
  2021-07-28 17:44         ` Florian Weimer via Libc-alpha
  0 siblings, 2 replies; 17+ messages in thread
From: Aleksa Sarai @ 2021-07-27 10:22 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Florian Weimer, Florian Weimer via Libc-alpha

[-- Attachment #1: Type: text/plain, Size: 4638 bytes --]

On 2021-07-27, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote:
> > On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote:
> > > * Florian Weimer via Libc-alpha:
> > > 
> > > > Reportedly, the docker package in Ubuntu as used by Github Actions and
> > > > others does not provide a way to enable the clone3 system call.  It
> > > > always fails with EPERM.
> > > >
> > > > Should we apply a patch like this for the release?
> > > >
> > > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
> > > > index 1e7a8f6b35..4046c81180 100644
> > > > --- a/sysdeps/unix/sysv/linux/clone-internal.c
> > > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c
> > > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args,
> > > >  		  int (*func) (void *arg), void *arg)
> > > >  {
> > > >    int ret;
> > > > -#ifdef HAVE_CLONE3_WAPPER
> > > > -  /* Try clone3 first.  */
> > > > -  int saved_errno = errno;
> > > > -  ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
> > > > -  if (ret != -1 || errno != ENOSYS)
> > > > -    return ret;
> > > > -
> > > > -  /* NB: Restore errno since errno may be checked against non-zero
> > > > -     return value.  */
> > > > -  __set_errno (saved_errno);
> > > > -#endif
> > > >  
> > > >    /* Map clone3 arguments to clone arguments.  NB: No need to check
> > > >       invalid clone3 specific bits in flags nor exit_signal since this
> > > >
> > > > My concern with this is that we don't know yet where the CET kernel API
> > > > will land exactly and if CET will require clone3.  So clone3 might have
> > > > to come back once we turn on CET, which is hopefully soon.
> > > 
> > > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the
> > > issue with its docker.io/containerd/runc packages.
> > > 
> > > I could trivially fix a previously failing Github Action with:
> > > 
> > > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml
> > > index d2381ec..7b10286 100644
> > > --- a/.github/workflows/fedora.yml
> > > +++ b/.github/workflows/fedora.yml
> > > @@ -22,6 +22,7 @@ jobs:
> > >      runs-on: ubuntu-latest
> > >      container:
> > >        image: fedora:${{matrix.release}}
> > > +      options: --security-opt seccomp=unconfined
> > >  
> > >      steps:
> > >        - name: Checkout repository
> > > 
> > > So I think we need to figure out what people are actually complaining
> > > about.
> > 
> > This relates to the discussion what errno value should be used in a
> > seccomp filter to indicate that a syscall is blocked.
> > 
> > So there are two problems I see with seccomp and clone3():
> > 1. the profile doesn't include clone3() at all and therefore the syscall
> >    is blocked and the default action is EPERM
> > 2. the profile does include clone3() and decided to block it but the
> >    runtime has decided to make seccomp return EPERM and not ENOSYS when
> >    clone3() is attempted
> > 
> > The correct fix in both scenarios is to add clone3() to the seccomp
> > profile and either allow it or return ENOSYS.
> > 
> > Note that this ENOSYS/EPERM problem is a general problem. Not just glibc
> > doesn't know when to fallback gracefully other tools don't know either.
> > Application container usually just get lucky because their applications
> > don't need to issue the syscalls that are blocked. On a generic system
> > container with systemd inside this is always an issue and not using
> > ENOSYS is guaranteed to fail across the board.
> 
> Aleksa, this is fixed in runC, right?

Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.

The way it works is that any syscall which has a larger syscall number
than any syscall specified in the filter will get -ENOSYS (this works
even if libseccomp is outdated). The only way you could get the -EPERM
behaviour with modern runc is if you write a seccomp profile that had
rules for newer syscalls (openat2 for instance) but not clone3 -- but
Docker doesn't do that. (The reason for this slightly convoluted
behaviour was to make sure that intentional omissions actually give you
-EPERM.)

However this requires the container host to have an updated version of
runc which is up to GitHub. (Though we fixed a security issue in runc
recently, so I would expect that they've updated their versions of runc
by now.)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27 10:22       ` Aleksa Sarai
@ 2021-07-27 10:48         ` Szabolcs Nagy via Libc-alpha
  2021-07-29  8:56           ` Aleksa Sarai
  2021-07-28 17:44         ` Florian Weimer via Libc-alpha
  1 sibling, 1 reply; 17+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-07-27 10:48 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Florian Weimer, Christian Brauner, Florian Weimer via Libc-alpha

The 07/27/2021 20:22, Aleksa Sarai wrote:
> Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> 
> The way it works is that any syscall which has a larger syscall number
> than any syscall specified in the filter will get -ENOSYS (this works
> even if libseccomp is outdated). The only way you could get the -EPERM
> behaviour with modern runc is if you write a seccomp profile that had
> rules for newer syscalls (openat2 for instance) but not clone3 -- but
> Docker doesn't do that. (The reason for this slightly convoluted
> behaviour was to make sure that intentional omissions actually give you
> -EPERM.)

this sounds broken. it really should return ENOSYS unless
a user specifically asked for a different errno value for
a syscall. EPERM is just wrong.

we will see random breakage in the future depending on
what unrelated but newer syscalls users added to their
whitelist. who thought this was a good idea?

i can't believe this is still broken, this seccomp filter
bug was reported ages ago.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27  8:43 RFC: Disable clone3 for glibc 2.34 Florian Weimer via Libc-alpha
  2021-07-27  9:11 ` Florian Weimer via Libc-alpha
@ 2021-07-27 23:07 ` Andreas K. Huettel via Libc-alpha
  2021-07-28  4:58   ` Florian Weimer via Libc-alpha
  1 sibling, 1 reply; 17+ messages in thread
From: Andreas K. Huettel via Libc-alpha @ 2021-07-27 23:07 UTC (permalink / raw)
  To: libc-alpha; +Cc: Florian Weimer

[-- Attachment #1: Type: text/plain, Size: 351 bytes --]

Semi-related...

>    int ret;
> -#ifdef HAVE_CLONE3_WAPPER
> -  /* Try clone3 first.  */
> -  int saved_errno = errno;

shouldn't this be WRAPPER ?

(It seems to be consistent in the code, so no harm done so far...)

-- 
Andreas K. Hüttel
dilfridge@gentoo.org
Gentoo Linux developer
(council, toolchain, base-system, perl, libreoffice)

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 981 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27 23:07 ` Andreas K. Huettel via Libc-alpha
@ 2021-07-28  4:58   ` Florian Weimer via Libc-alpha
  2021-07-28 17:22     ` [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER H.J. Lu via Libc-alpha
  0 siblings, 1 reply; 17+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-07-28  4:58 UTC (permalink / raw)
  To: Andreas K. Huettel via Libc-alpha

* Andreas K. Huettel via Libc-alpha:

> Semi-related...
>
>>    int ret;
>> -#ifdef HAVE_CLONE3_WAPPER
>> -  /* Try clone3 first.  */
>> -  int saved_errno = errno;
>
> shouldn't this be WRAPPER ?
>
> (It seems to be consistent in the code, so no harm done so far...)

Yes, it should be.  And yes, it's currently consistent.  But it's
confusing if you search for it and don't make the typo.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER
  2021-07-28  4:58   ` Florian Weimer via Libc-alpha
@ 2021-07-28 17:22     ` H.J. Lu via Libc-alpha
  2021-07-28 17:35       ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 17+ messages in thread
From: H.J. Lu via Libc-alpha @ 2021-07-28 17:22 UTC (permalink / raw)
  To: Florian Weimer, Carlos O'Donell; +Cc: Andreas K. Huettel via Libc-alpha

[-- Attachment #1: Type: text/plain, Size: 589 bytes --]

On Tue, Jul 27, 2021 at 9:59 PM Florian Weimer via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> * Andreas K. Huettel via Libc-alpha:
>
> > Semi-related...
> >
> >>    int ret;
> >> -#ifdef HAVE_CLONE3_WAPPER
> >> -  /* Try clone3 first.  */
> >> -  int saved_errno = errno;
> >
> > shouldn't this be WRAPPER ?
> >
> > (It seems to be consistent in the code, so no harm done so far...)
>
> Yes, it should be.  And yes, it's currently consistent.  But it's
> confusing if you search for it and don't make the typo.
>
> Thanks,
> Florian
>

Here is the patch.   OK for master?

-- 
H.J.

[-- Attachment #2: 0001-Typo-Rename-HAVE_CLONE3_WAPPER-to-HAVE_CLONE3_WRAPPE.patch --]
[-- Type: text/x-patch, Size: 1946 bytes --]

From c25c32165d8b7c506442fdc0304f7a3a223e1f42 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Wed, 28 Jul 2021 10:19:08 -0700
Subject: [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER

---
 sysdeps/unix/sysv/linux/clone-internal.c | 2 +-
 sysdeps/unix/sysv/linux/i386/sysdep.h    | 2 +-
 sysdeps/unix/sysv/linux/x86_64/sysdep.h  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c
index 1e7a8f6b35..979f7880be 100644
--- a/sysdeps/unix/sysv/linux/clone-internal.c
+++ b/sysdeps/unix/sysv/linux/clone-internal.c
@@ -48,7 +48,7 @@ __clone_internal (struct clone_args *cl_args,
 		  int (*func) (void *arg), void *arg)
 {
   int ret;
-#ifdef HAVE_CLONE3_WAPPER
+#ifdef HAVE_CLONE3_WRAPPER
   /* Try clone3 first.  */
   int saved_errno = errno;
   ret = __clone3 (cl_args, sizeof (*cl_args), func, arg);
diff --git a/sysdeps/unix/sysv/linux/i386/sysdep.h b/sysdeps/unix/sysv/linux/i386/sysdep.h
index 3927a1a6e0..8a9911b7ac 100644
--- a/sysdeps/unix/sysv/linux/i386/sysdep.h
+++ b/sysdeps/unix/sysv/linux/i386/sysdep.h
@@ -291,7 +291,7 @@ struct libc_do_syscall_args
 # define HAVE_TIME_VSYSCALL             "__vdso_time"
 # define HAVE_CLOCK_GETRES_VSYSCALL     "__vdso_clock_getres"
 
-# define HAVE_CLONE3_WAPPER		1
+# define HAVE_CLONE3_WRAPPER		1
 
 # undef HAVE_INTERNAL_BRK_ADDR_SYMBOL
 # define HAVE_INTERNAL_BRK_ADDR_SYMBOL 1
diff --git a/sysdeps/unix/sysv/linux/x86_64/sysdep.h b/sysdeps/unix/sysv/linux/x86_64/sysdep.h
index f26ffc68ae..327e59388b 100644
--- a/sysdeps/unix/sysv/linux/x86_64/sysdep.h
+++ b/sysdeps/unix/sysv/linux/x86_64/sysdep.h
@@ -377,7 +377,7 @@
 # define HAVE_GETCPU_VSYSCALL		"__vdso_getcpu"
 # define HAVE_CLOCK_GETRES64_VSYSCALL   "__vdso_clock_getres"
 
-# define HAVE_CLONE3_WAPPER			1
+# define HAVE_CLONE3_WRAPPER			1
 
 # define SINGLE_THREAD_BY_GLOBAL		1
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER
  2021-07-28 17:22     ` [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER H.J. Lu via Libc-alpha
@ 2021-07-28 17:35       ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 17+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-07-28 17:35 UTC (permalink / raw)
  To: H.J. Lu, Florian Weimer, Carlos O'Donell
  Cc: Andreas K. Huettel via Libc-alpha

On 28/07/2021 14:22, H.J. Lu via Libc-alpha wrote:
> On Tue, Jul 27, 2021 at 9:59 PM Florian Weimer via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> * Andreas K. Huettel via Libc-alpha:
>>
>>> Semi-related...
>>>
>>>>    int ret;
>>>> -#ifdef HAVE_CLONE3_WAPPER
>>>> -  /* Try clone3 first.  */
>>>> -  int saved_errno = errno;
>>>
>>> shouldn't this be WRAPPER ?
>>>
>>> (It seems to be consistent in the code, so no harm done so far...)
>>
>> Yes, it should be.  And yes, it's currently consistent.  But it's
>> confusing if you search for it and don't make the typo.
>>
>> Thanks,
>> Florian
>>
> 
> Here is the patch.   OK for master?
> 

LGTM, thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27 10:22       ` Aleksa Sarai
  2021-07-27 10:48         ` Szabolcs Nagy via Libc-alpha
@ 2021-07-28 17:44         ` Florian Weimer via Libc-alpha
  2021-07-29  8:36           ` Daniel P. Berrangé via Libc-alpha
  1 sibling, 1 reply; 17+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-07-28 17:44 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Daniel P. Berrangé, Christian Brauner,
	Florian Weimer via Libc-alpha

* Aleksa Sarai:

> Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
>
> The way it works is that any syscall which has a larger syscall number
> than any syscall specified in the filter will get -ENOSYS (this works
> even if libseccomp is outdated). The only way you could get the -EPERM
> behaviour with modern runc is if you write a seccomp profile that had
> rules for newer syscalls (openat2 for instance) but not clone3 -- but
> Docker doesn't do that. (The reason for this slightly convoluted
> behaviour was to make sure that intentional omissions actually give you
> -EPERM.)
>
> However this requires the container host to have an updated version of
> runc which is up to GitHub. (Though we fixed a security issue in runc
> recently, so I would expect that they've updated their versions of runc
> by now.)

Indeed I wasn't able to reproduce this locally.  Ubuntu's docker.io
package behaves as expected, even for “docker build” as far as I can
see.

So far, the reported breakage has been focused on Github Actions and
Azure Devops.  They use a custom Docker-Moby build, and I don't know
what's in it.  The net effect is that clone3 does not work in containers
by default.  “docker build” still does not allow “--security-opt
seccomp=unconfined” for unknown reasons, but that workaround still
applies to “docker create”.

Daniel P. Berrangé reported that Moby mentions a system call in its
policy whose number is larger than clone3, effectively turning ENOSYS
into ENOPERM for clone3.  Looking at the recent change, it could be the
addition of close_range and epoll_pwait2 in this commit:

commit 54eff4354b17a9c460b851300f28aed1408a8615
Author: Aleksa Sarai <asarai@suse.de>
Date:   Sun Jan 17 23:39:31 2021 +1100

    profiles: seccomp: update to Linux 5.11 syscall list

    These syscalls (some of which have been in Linux for a while but were
    missing from the profile) fall into a few buckets:

     * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
       for everyone" syscalls.

     * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
       all equivalent to aspects of mount(2) and thus go into the
       CAP_SYS_ADMIN category.

     * process_madvise(2) is similar to the other process_*(2) syscalls and
       thus goes in the CAP_SYS_PTRACE category.

    Signed-off-by: Aleksa Sarai <asarai@suse.de>

Maybe we don't see this everywhere because these higher system call
numbers become available only if the system libseccomp version is recent
enough to know about them.  Once that is the case, the ENOSYS/EPERM line
shifts and clone3 is on the wrong side of it.

If that's indeed the explanation, then maybe we can simply fix moby and
ask Microsoft to respin their images?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-28 17:44         ` Florian Weimer via Libc-alpha
@ 2021-07-29  8:36           ` Daniel P. Berrangé via Libc-alpha
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel P. Berrangé via Libc-alpha @ 2021-07-29  8:36 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Aleksa Sarai, Florian Weimer via Libc-alpha, Christian Brauner

On Wed, Jul 28, 2021 at 07:44:03PM +0200, Florian Weimer wrote:
> * Aleksa Sarai:
> 
> > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> >
> > The way it works is that any syscall which has a larger syscall number
> > than any syscall specified in the filter will get -ENOSYS (this works
> > even if libseccomp is outdated). The only way you could get the -EPERM
> > behaviour with modern runc is if you write a seccomp profile that had
> > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > Docker doesn't do that. (The reason for this slightly convoluted
> > behaviour was to make sure that intentional omissions actually give you
> > -EPERM.)
> >
> > However this requires the container host to have an updated version of
> > runc which is up to GitHub. (Though we fixed a security issue in runc
> > recently, so I would expect that they've updated their versions of runc
> > by now.)
> 
> Indeed I wasn't able to reproduce this locally.  Ubuntu's docker.io
> package behaves as expected, even for “docker build” as far as I can
> see.
> 
> So far, the reported breakage has been focused on Github Actions and
> Azure Devops.  They use a custom Docker-Moby build, and I don't know
> what's in it.  The net effect is that clone3 does not work in containers
> by default.  “docker build” still does not allow “--security-opt
> seccomp=unconfined” for unknown reasons, but that workaround still
> applies to “docker create”.

FYI I found this issue describing the docker build + seccomp feature
gap. Seems it was intentional to not allow it to be used:

   https://github.com/moby/moby/issues/34454

To workaround this gap in "docker build", you can use the --seccomp-profile
option to dockerd daemon when starting it up, to pass a new profile that
applies by default to everything. Doesn't help if you're just using a
dockerd instance started/managed by someone/something else though.

> Daniel P. Berrangé reported that Moby mentions a system call in its
> policy whose number is larger than clone3, effectively turning ENOSYS
> into ENOPERM for clone3.  Looking at the recent change, it could be the
> addition of close_range and epoll_pwait2 in this commit:

I'm not 100% convinced my understanding is right, as there are quite
a few moving parts involved. Wierdly I managed to get things working
with existing docker 20.10.7 simply by using the seccomp profile from
docker git master, which is counter to my understanding described
above. The heuristics involved in runc for EPERM/ENOSYS are very
hard to understand and rationalize behaviour for :-(

To try to make it simpler I send a pull request to explicitly list
clone3 with ENOSYS, so that its not subject to the wierd heuristics
in runc

   https://github.com/moby/moby/pull/42681

> commit 54eff4354b17a9c460b851300f28aed1408a8615
> Author: Aleksa Sarai <asarai@suse.de>
> Date:   Sun Jan 17 23:39:31 2021 +1100
> 
>     profiles: seccomp: update to Linux 5.11 syscall list
>     
>     These syscalls (some of which have been in Linux for a while but were
>     missing from the profile) fall into a few buckets:
>     
>      * close_range(2), epoll_pwait2(2) are just extensions of existing "safe
>        for everyone" syscalls.
>     
>      * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are
>        all equivalent to aspects of mount(2) and thus go into the
>        CAP_SYS_ADMIN category.
>     
>      * process_madvise(2) is similar to the other process_*(2) syscalls and
>        thus goes in the CAP_SYS_PTRACE category.
>     
>     Signed-off-by: Aleksa Sarai <asarai@suse.de>
> 
> Maybe we don't see this everywhere because these higher system call
> numbers become available only if the system libseccomp version is recent
> enough to know about them.  Once that is the case, the ENOSYS/EPERM line
> shifts and clone3 is on the wrong side of it.
> 
> If that's indeed the explanation, then maybe we can simply fix moby and
> ask Microsoft to respin their images?

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-27 10:48         ` Szabolcs Nagy via Libc-alpha
@ 2021-07-29  8:56           ` Aleksa Sarai
  2021-07-29 10:50             ` Florian Weimer via Libc-alpha
  2021-07-29 11:38             ` Szabolcs Nagy via Libc-alpha
  0 siblings, 2 replies; 17+ messages in thread
From: Aleksa Sarai @ 2021-07-29  8:56 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Florian Weimer, Christian Brauner, Florian Weimer via Libc-alpha

[-- Attachment #1: Type: text/plain, Size: 2370 bytes --]

On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 07/27/2021 20:22, Aleksa Sarai wrote:
> > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > 
> > The way it works is that any syscall which has a larger syscall number
> > than any syscall specified in the filter will get -ENOSYS (this works
> > even if libseccomp is outdated). The only way you could get the -EPERM
> > behaviour with modern runc is if you write a seccomp profile that had
> > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > Docker doesn't do that. (The reason for this slightly convoluted
> > behaviour was to make sure that intentional omissions actually give you
> > -EPERM.)
> 
> this sounds broken. it really should return ENOSYS unless
> a user specifically asked for a different errno value for
> a syscall. EPERM is just wrong.

Yes, if I was designing it from scratch, that's what I would've done.

But there are already existing filters that are written assuming the
default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
existing profiles is not a workable solution.

Should we fix all existing profiles and then change the behaviour again?
Sure, but given we solved this problem in a period of time when people
were screaming about glibc being broken in containers, I hope you'll
excuse the fact that we didn't really have time to co-ordinate updating
every downstream runc user.

> we will see random breakage in the future depending on
> what unrelated but newer syscalls users added to their
> whitelist. who thought this was a good idea?

If you update your syscall profile without knowing what you're doing,
things will break. That will always be the case.

The plan is/was to eventually implement this by explicitly stating a
minimum kernel version (so that all syscalls missing in the profile that
were available in that kernel version get ENOSYS) but libseccomp doesn't
provide that information at the moment, and given that such a filter
would be more complicated than the one we have at the moment, that
behaviour probably belongs in libseccomp (there are several issues open
in the libseccomp repo describing this issue and possible solutions).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-29  8:56           ` Aleksa Sarai
@ 2021-07-29 10:50             ` Florian Weimer via Libc-alpha
  2021-07-30 12:16               ` Aleksa Sarai
  2021-07-29 11:38             ` Szabolcs Nagy via Libc-alpha
  1 sibling, 1 reply; 17+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-07-29 10:50 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Szabolcs Nagy, Christian Brauner, Florian Weimer via Libc-alpha

* Aleksa Sarai:

> If you update your syscall profile without knowing what you're doing,
> things will break. That will always be the case.

But with the current syscall number dependency, this is jusy *way* too
hard.  Who would think that adding close_range (#436) to the policy
would switch clone3 (#435) from ENOSYS to ENOPERM?

I realized that Github actually provides a way to report image bugs, so
I filed:

  Docker seccomp policy incompatible with glibc 2.34
  <https://github.com/actions/virtual-environments/issues/3812>

Thanks,
Florian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-29  8:56           ` Aleksa Sarai
  2021-07-29 10:50             ` Florian Weimer via Libc-alpha
@ 2021-07-29 11:38             ` Szabolcs Nagy via Libc-alpha
  2021-07-30 15:08               ` Aleksa Sarai
  1 sibling, 1 reply; 17+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-07-29 11:38 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Florian Weimer, Christian Brauner, Florian Weimer via Libc-alpha

The 07/29/2021 18:56, Aleksa Sarai wrote:
> On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > The 07/27/2021 20:22, Aleksa Sarai wrote:
> > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > > 
> > > The way it works is that any syscall which has a larger syscall number
> > > than any syscall specified in the filter will get -ENOSYS (this works
> > > even if libseccomp is outdated). The only way you could get the -EPERM
> > > behaviour with modern runc is if you write a seccomp profile that had
> > > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > > Docker doesn't do that. (The reason for this slightly convoluted
> > > behaviour was to make sure that intentional omissions actually give you
> > > -EPERM.)
> > 
> > this sounds broken. it really should return ENOSYS unless
> > a user specifically asked for a different errno value for
> > a syscall. EPERM is just wrong.
> 
> Yes, if I was designing it from scratch, that's what I would've done.
> 
> But there are already existing filters that are written assuming the
> default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
> existing profiles is not a workable solution.
> 
> Should we fix all existing profiles and then change the behaviour again?
> Sure, but given we solved this problem in a period of time when people
> were screaming about glibc being broken in containers, I hope you'll
> excuse the fact that we didn't really have time to co-ordinate updating
> every downstream runc user.

i think this can be fixed backward compatibly by
returning EPERM for old syscalls.

> > we will see random breakage in the future depending on
> > what unrelated but newer syscalls users added to their
> > whitelist. who thought this was a good idea?
> 
> If you update your syscall profile without knowing what you're doing,
> things will break. That will always be the case.
> 
> The plan is/was to eventually implement this by explicitly stating a
> minimum kernel version (so that all syscalls missing in the profile that
> were available in that kernel version get ENOSYS) but libseccomp doesn't
> provide that information at the moment, and given that such a filter
> would be more complicated than the one we have at the moment, that
> behaviour probably belongs in libseccomp (there are several issues open
> in the libseccomp repo describing this issue and possible solutions).

i dont think you need to do anything complicated
with a fixed cut off, e.g.

  return nr < 403 ? EPERM : ENOSYS

or you can give an explicit list of syscalls that
should return EPERM for bw compat reasons and the
rest is ENOSYS.

(and there should be an easy way to opt-out of
the bw compat behaviour and always do ENOSYS)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-29 10:50             ` Florian Weimer via Libc-alpha
@ 2021-07-30 12:16               ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2021-07-30 12:16 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Szabolcs Nagy, Christian Brauner, Florian Weimer via Libc-alpha

[-- Attachment #1: Type: text/plain, Size: 979 bytes --]

On 2021-07-29, Florian Weimer <fweimer@redhat.com> wrote:
> * Aleksa Sarai:
> 
> > If you update your syscall profile without knowing what you're doing,
> > things will break. That will always be the case.
> 
> But with the current syscall number dependency, this is jusy *way* too
> hard.  Who would think that adding close_range (#436) to the policy
> would switch clone3 (#435) from ENOSYS to ENOPERM?

Yeah, I expected that the Docker folks would've been aware of this when
updating the profile (the maintainers were aware of the runc change at
the time) so it does seem this is a bit too complicated...

I think changing this to one of the older versions of the feature I had
(only EPERM for syscalls that were present in Linux 3.0) is probably
less likely to cause confusion, until we have the whole minimum kernel
version infrastructure I mentioned.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: RFC: Disable clone3 for glibc 2.34
  2021-07-29 11:38             ` Szabolcs Nagy via Libc-alpha
@ 2021-07-30 15:08               ` Aleksa Sarai
  0 siblings, 0 replies; 17+ messages in thread
From: Aleksa Sarai @ 2021-07-30 15:08 UTC (permalink / raw)
  To: Szabolcs Nagy
  Cc: Florian Weimer, Christian Brauner, Florian Weimer via Libc-alpha

[-- Attachment #1: Type: text/plain, Size: 5481 bytes --]

On 2021-07-29, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 07/29/2021 18:56, Aleksa Sarai wrote:
> > On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > > The 07/27/2021 20:22, Aleksa Sarai wrote:
> > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > > > 
> > > > The way it works is that any syscall which has a larger syscall number
> > > > than any syscall specified in the filter will get -ENOSYS (this works
> > > > even if libseccomp is outdated). The only way you could get the -EPERM
> > > > behaviour with modern runc is if you write a seccomp profile that had
> > > > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > > > Docker doesn't do that. (The reason for this slightly convoluted
> > > > behaviour was to make sure that intentional omissions actually give you
> > > > -EPERM.)
> > > 
> > > this sounds broken. it really should return ENOSYS unless
> > > a user specifically asked for a different errno value for
> > > a syscall. EPERM is just wrong.
> > 
> > Yes, if I was designing it from scratch, that's what I would've done.
> > 
> > But there are already existing filters that are written assuming the
> > default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
> > existing profiles is not a workable solution.
> > 
> > Should we fix all existing profiles and then change the behaviour again?
> > Sure, but given we solved this problem in a period of time when people
> > were screaming about glibc being broken in containers, I hope you'll
> > excuse the fact that we didn't really have time to co-ordinate updating
> > every downstream runc user.
> 
> i think this can be fixed backward compatibly by
> returning EPERM for old syscalls.

I just remembered why this was a problem (I tried to implement exactly
this behaviour when I first worked on the patch) -- in runc we use
libseccomp to generate profiles, but libseccomp has several limitations
that mean we cannot implement *any* syscall-number-based fallback
behaviour (I tried every way I could think of for at least a week or
two).

The end result is that in runc we use libseccomp to generate the filter,
then output the BPF program, then patch it (add a program to the start
which does the syscall check and returns ENOSYS in the correct case) and
then we run the program. The only other option would be to basically
rewrite libseccomp for runc (which I did consider, but then thought
better of).

Now, having the fallback being a fixed syscall number seems like it
would be trivial -- but because we have to use libseccomp's filter
generatiion and patch it, patching the return value of the program to be
syscall-specific would require patching every return statement in the
generated BPF (and then possibly rewriting every jump depending on where
the returns are). I think in practice there's only two returns, but
hardcoding that is going to cause issues if libseccomp ever changes that
behaviour.

But I think you're right that this is probably less likely to cause
confusion. Unfortunately I'm not really sure that there's a
straightforward way to implement it, outside of implementing the
behaviour in libseccomp (or at the very least expanding libseccomp to
let us work around it).

> > > we will see random breakage in the future depending on
> > > what unrelated but newer syscalls users added to their
> > > whitelist. who thought this was a good idea?
> > 
> > If you update your syscall profile without knowing what you're doing,
> > things will break. That will always be the case.
> > 
> > The plan is/was to eventually implement this by explicitly stating a
> > minimum kernel version (so that all syscalls missing in the profile that
> > were available in that kernel version get ENOSYS) but libseccomp doesn't
> > provide that information at the moment, and given that such a filter
> > would be more complicated than the one we have at the moment, that
> > behaviour probably belongs in libseccomp (there are several issues open
> > in the libseccomp repo describing this issue and possible solutions).
> 
> i dont think you need to do anything complicated
> with a fixed cut off, e.g.
> 
>   return nr < 403 ? EPERM : ENOSYS

See my above point about how this is non-trivial.

> or you can give an explicit list of syscalls that
> should return EPERM for bw compat reasons and the
> rest is ENOSYS.

The issue with requiring explicit EPERM rules is that libseccomp doesn't
support certain sets of argument rule combinations, meaning that you
cannot generate nor require users to specify a set of inverse rules to
return EPERM for syscalls like clone() where only certain flags are
being blocked. (Another one of my attempts was to generate the set of
inverse rules, and use ENOSYS as the default.)

(Also some of the inverse rules you can technically generate -- most
notably the inverse of SCMP_MASKED_EQ requires 2^n rules to be generated
(where n is the number of bits in the mask).)

> (and there should be an easy way to opt-out of
> the bw compat behaviour and always do ENOSYS)

You can opt of out of it already, by setting the defaultErrnoRet to
ENOSYS. And someone has submitted a patch to Docker to do exactly
that[1].

[1]: https://github.com/moby/moby/pull/42649

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-07-30 15:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27  8:43 RFC: Disable clone3 for glibc 2.34 Florian Weimer via Libc-alpha
2021-07-27  9:11 ` Florian Weimer via Libc-alpha
2021-07-27  9:24   ` Christian Brauner
2021-07-27  9:41     ` Christian Brauner
2021-07-27 10:22       ` Aleksa Sarai
2021-07-27 10:48         ` Szabolcs Nagy via Libc-alpha
2021-07-29  8:56           ` Aleksa Sarai
2021-07-29 10:50             ` Florian Weimer via Libc-alpha
2021-07-30 12:16               ` Aleksa Sarai
2021-07-29 11:38             ` Szabolcs Nagy via Libc-alpha
2021-07-30 15:08               ` Aleksa Sarai
2021-07-28 17:44         ` Florian Weimer via Libc-alpha
2021-07-29  8:36           ` Daniel P. Berrangé via Libc-alpha
2021-07-27 23:07 ` Andreas K. Huettel via Libc-alpha
2021-07-28  4:58   ` Florian Weimer via Libc-alpha
2021-07-28 17:22     ` [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER H.J. Lu via Libc-alpha
2021-07-28 17:35       ` Adhemerval Zanella via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).