On 2021-07-27, Christian Brauner wrote: > On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote: > > On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote: > > > * Florian Weimer via Libc-alpha: > > > > > > > Reportedly, the docker package in Ubuntu as used by Github Actions and > > > > others does not provide a way to enable the clone3 system call. It > > > > always fails with EPERM. > > > > > > > > Should we apply a patch like this for the release? > > > > > > > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c > > > > index 1e7a8f6b35..4046c81180 100644 > > > > --- a/sysdeps/unix/sysv/linux/clone-internal.c > > > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c > > > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args, > > > > int (*func) (void *arg), void *arg) > > > > { > > > > int ret; > > > > -#ifdef HAVE_CLONE3_WAPPER > > > > - /* Try clone3 first. */ > > > > - int saved_errno = errno; > > > > - ret = __clone3 (cl_args, sizeof (*cl_args), func, arg); > > > > - if (ret != -1 || errno != ENOSYS) > > > > - return ret; > > > > - > > > > - /* NB: Restore errno since errno may be checked against non-zero > > > > - return value. */ > > > > - __set_errno (saved_errno); > > > > -#endif > > > > > > > > /* Map clone3 arguments to clone arguments. NB: No need to check > > > > invalid clone3 specific bits in flags nor exit_signal since this > > > > > > > > My concern with this is that we don't know yet where the CET kernel API > > > > will land exactly and if CET will require clone3. So clone3 might have > > > > to come back once we turn on CET, which is hopefully soon. > > > > > > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the > > > issue with its docker.io/containerd/runc packages. > > > > > > I could trivially fix a previously failing Github Action with: > > > > > > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml > > > index d2381ec..7b10286 100644 > > > --- a/.github/workflows/fedora.yml > > > +++ b/.github/workflows/fedora.yml > > > @@ -22,6 +22,7 @@ jobs: > > > runs-on: ubuntu-latest > > > container: > > > image: fedora:${{matrix.release}} > > > + options: --security-opt seccomp=unconfined > > > > > > steps: > > > - name: Checkout repository > > > > > > So I think we need to figure out what people are actually complaining > > > about. > > > > This relates to the discussion what errno value should be used in a > > seccomp filter to indicate that a syscall is blocked. > > > > So there are two problems I see with seccomp and clone3(): > > 1. the profile doesn't include clone3() at all and therefore the syscall > > is blocked and the default action is EPERM > > 2. the profile does include clone3() and decided to block it but the > > runtime has decided to make seccomp return EPERM and not ENOSYS when > > clone3() is attempted > > > > The correct fix in both scenarios is to add clone3() to the seccomp > > profile and either allow it or return ENOSYS. > > > > Note that this ENOSYS/EPERM problem is a general problem. Not just glibc > > doesn't know when to fallback gracefully other tools don't know either. > > Application container usually just get lucky because their applications > > don't need to issue the syscalls that are blocked. On a generic system > > container with systemd inside this is always an issue and not using > > ENOSYS is guaranteed to fail across the board. > > Aleksa, this is fixed in runC, right? Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. The way it works is that any syscall which has a larger syscall number than any syscall specified in the filter will get -ENOSYS (this works even if libseccomp is outdated). The only way you could get the -EPERM behaviour with modern runc is if you write a seccomp profile that had rules for newer syscalls (openat2 for instance) but not clone3 -- but Docker doesn't do that. (The reason for this slightly convoluted behaviour was to make sure that intentional omissions actually give you -EPERM.) However this requires the container host to have an updated version of runc which is up to GitHub. (Though we fixed a security issue in runc recently, so I would expect that they've updated their versions of runc by now.) -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH