unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Aleksa Sarai <cyphar@cyphar.com>
To: Szabolcs Nagy <szabolcs.nagy@arm.com>
Cc: Florian Weimer <fweimer@redhat.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Florian Weimer via Libc-alpha <libc-alpha@sourceware.org>
Subject: Re: RFC: Disable clone3 for glibc 2.34
Date: Sat, 31 Jul 2021 01:08:59 +1000	[thread overview]
Message-ID: <20210730150859.ebuzlvnva3ym7smq@senku> (raw)
In-Reply-To: <20210729113829.GD14854@arm.com>

[-- Attachment #1: Type: text/plain, Size: 5481 bytes --]

On 2021-07-29, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> The 07/29/2021 18:56, Aleksa Sarai wrote:
> > On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote:
> > > The 07/27/2021 20:22, Aleksa Sarai wrote:
> > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now.
> > > > 
> > > > The way it works is that any syscall which has a larger syscall number
> > > > than any syscall specified in the filter will get -ENOSYS (this works
> > > > even if libseccomp is outdated). The only way you could get the -EPERM
> > > > behaviour with modern runc is if you write a seccomp profile that had
> > > > rules for newer syscalls (openat2 for instance) but not clone3 -- but
> > > > Docker doesn't do that. (The reason for this slightly convoluted
> > > > behaviour was to make sure that intentional omissions actually give you
> > > > -EPERM.)
> > > 
> > > this sounds broken. it really should return ENOSYS unless
> > > a user specifically asked for a different errno value for
> > > a syscall. EPERM is just wrong.
> > 
> > Yes, if I was designing it from scratch, that's what I would've done.
> > 
> > But there are already existing filters that are written assuming the
> > default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for
> > existing profiles is not a workable solution.
> > 
> > Should we fix all existing profiles and then change the behaviour again?
> > Sure, but given we solved this problem in a period of time when people
> > were screaming about glibc being broken in containers, I hope you'll
> > excuse the fact that we didn't really have time to co-ordinate updating
> > every downstream runc user.
> 
> i think this can be fixed backward compatibly by
> returning EPERM for old syscalls.

I just remembered why this was a problem (I tried to implement exactly
this behaviour when I first worked on the patch) -- in runc we use
libseccomp to generate profiles, but libseccomp has several limitations
that mean we cannot implement *any* syscall-number-based fallback
behaviour (I tried every way I could think of for at least a week or
two).

The end result is that in runc we use libseccomp to generate the filter,
then output the BPF program, then patch it (add a program to the start
which does the syscall check and returns ENOSYS in the correct case) and
then we run the program. The only other option would be to basically
rewrite libseccomp for runc (which I did consider, but then thought
better of).

Now, having the fallback being a fixed syscall number seems like it
would be trivial -- but because we have to use libseccomp's filter
generatiion and patch it, patching the return value of the program to be
syscall-specific would require patching every return statement in the
generated BPF (and then possibly rewriting every jump depending on where
the returns are). I think in practice there's only two returns, but
hardcoding that is going to cause issues if libseccomp ever changes that
behaviour.

But I think you're right that this is probably less likely to cause
confusion. Unfortunately I'm not really sure that there's a
straightforward way to implement it, outside of implementing the
behaviour in libseccomp (or at the very least expanding libseccomp to
let us work around it).

> > > we will see random breakage in the future depending on
> > > what unrelated but newer syscalls users added to their
> > > whitelist. who thought this was a good idea?
> > 
> > If you update your syscall profile without knowing what you're doing,
> > things will break. That will always be the case.
> > 
> > The plan is/was to eventually implement this by explicitly stating a
> > minimum kernel version (so that all syscalls missing in the profile that
> > were available in that kernel version get ENOSYS) but libseccomp doesn't
> > provide that information at the moment, and given that such a filter
> > would be more complicated than the one we have at the moment, that
> > behaviour probably belongs in libseccomp (there are several issues open
> > in the libseccomp repo describing this issue and possible solutions).
> 
> i dont think you need to do anything complicated
> with a fixed cut off, e.g.
> 
>   return nr < 403 ? EPERM : ENOSYS

See my above point about how this is non-trivial.

> or you can give an explicit list of syscalls that
> should return EPERM for bw compat reasons and the
> rest is ENOSYS.

The issue with requiring explicit EPERM rules is that libseccomp doesn't
support certain sets of argument rule combinations, meaning that you
cannot generate nor require users to specify a set of inverse rules to
return EPERM for syscalls like clone() where only certain flags are
being blocked. (Another one of my attempts was to generate the set of
inverse rules, and use ENOSYS as the default.)

(Also some of the inverse rules you can technically generate -- most
notably the inverse of SCMP_MASKED_EQ requires 2^n rules to be generated
(where n is the number of bits in the mask).)

> (and there should be an easy way to opt-out of
> the bw compat behaviour and always do ENOSYS)

You can opt of out of it already, by setting the defaultErrnoRet to
ENOSYS. And someone has submitted a patch to Docker to do exactly
that[1].

[1]: https://github.com/moby/moby/pull/42649

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

  reply	other threads:[~2021-07-30 15:09 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-27  8:43 RFC: Disable clone3 for glibc 2.34 Florian Weimer via Libc-alpha
2021-07-27  9:11 ` Florian Weimer via Libc-alpha
2021-07-27  9:24   ` Christian Brauner
2021-07-27  9:41     ` Christian Brauner
2021-07-27 10:22       ` Aleksa Sarai
2021-07-27 10:48         ` Szabolcs Nagy via Libc-alpha
2021-07-29  8:56           ` Aleksa Sarai
2021-07-29 10:50             ` Florian Weimer via Libc-alpha
2021-07-30 12:16               ` Aleksa Sarai
2021-07-29 11:38             ` Szabolcs Nagy via Libc-alpha
2021-07-30 15:08               ` Aleksa Sarai [this message]
2021-07-28 17:44         ` Florian Weimer via Libc-alpha
2021-07-29  8:36           ` Daniel P. Berrangé via Libc-alpha
2021-07-27 23:07 ` Andreas K. Huettel via Libc-alpha
2021-07-28  4:58   ` Florian Weimer via Libc-alpha
2021-07-28 17:22     ` [PATCH] Typo: Rename HAVE_CLONE3_WAPPER to HAVE_CLONE3_WRAPPER H.J. Lu via Libc-alpha
2021-07-28 17:35       ` Adhemerval Zanella via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210730150859.ebuzlvnva3ym7smq@senku \
    --to=cyphar@cyphar.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=szabolcs.nagy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).