From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-3.2 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,PDS_RDNS_DYNAMIC_FP, RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id A1C241F8C6 for ; Fri, 30 Jul 2021 15:09:25 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 937A1394BE31 for ; Fri, 30 Jul 2021 15:09:24 +0000 (GMT) Received: from mout-p-202.mailbox.org (mout-p-202.mailbox.org [IPv6:2001:67c:2050::465:202]) by sourceware.org (Postfix) with ESMTPS id 62E4E38515F0 for ; Fri, 30 Jul 2021 15:09:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 62E4E38515F0 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=cyphar.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=cyphar.com Received: from smtp2.mailbox.org (smtp2.mailbox.org [80.241.60.241]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mout-p-202.mailbox.org (Postfix) with ESMTPS id 4GbrQb2SkvzQk8f; Fri, 30 Jul 2021 17:09:11 +0200 (CEST) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp2.mailbox.org ([80.241.60.241]) by spamfilter01.heinlein-hosting.de (spamfilter01.heinlein-hosting.de [80.241.56.115]) (amavisd-new, port 10030) with ESMTP id YqMbPFQpg4Ro; Fri, 30 Jul 2021 17:09:07 +0200 (CEST) Date: Sat, 31 Jul 2021 01:08:59 +1000 From: Aleksa Sarai To: Szabolcs Nagy Subject: Re: RFC: Disable clone3 for glibc 2.34 Message-ID: <20210730150859.ebuzlvnva3ym7smq@senku> References: <87eebkf8ph.fsf@oldenburg.str.redhat.com> <87y29sdsui.fsf@oldenburg.str.redhat.com> <20210727092416.layfgqi6auudbpgc@wittgenstein> <20210727094117.jid7shl7futsciih@wittgenstein> <20210727102222.r2hys526mfkpt4xo@senku> <20210727104816.GC14854@arm.com> <20210729085608.6n6hxithibfsdslj@senku> <20210729113829.GD14854@arm.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ancdq6rgtettg774" Content-Disposition: inline In-Reply-To: <20210729113829.GD14854@arm.com> X-Rspamd-Queue-Id: 12B9617FC X-Rspamd-UID: 883152 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Florian Weimer , Christian Brauner , Florian Weimer via Libc-alpha Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" --ancdq6rgtettg774 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2021-07-29, Szabolcs Nagy wrote: > The 07/29/2021 18:56, Aleksa Sarai wrote: > > On 2021-07-27, Szabolcs Nagy wrote: > > > The 07/27/2021 20:22, Aleksa Sarai wrote: > > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases= now. > > > >=20 > > > > The way it works is that any syscall which has a larger syscall num= ber > > > > than any syscall specified in the filter will get -ENOSYS (this wor= ks > > > > even if libseccomp is outdated). The only way you could get the -EP= ERM > > > > behaviour with modern runc is if you write a seccomp profile that h= ad > > > > rules for newer syscalls (openat2 for instance) but not clone3 -- b= ut > > > > Docker doesn't do that. (The reason for this slightly convoluted > > > > behaviour was to make sure that intentional omissions actually give= you > > > > -EPERM.) > > >=20 > > > this sounds broken. it really should return ENOSYS unless > > > a user specifically asked for a different errno value for > > > a syscall. EPERM is just wrong. > >=20 > > Yes, if I was designing it from scratch, that's what I would've done. > >=20 > > But there are already existing filters that are written assuming the > > default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for > > existing profiles is not a workable solution. > >=20 > > Should we fix all existing profiles and then change the behaviour again? > > Sure, but given we solved this problem in a period of time when people > > were screaming about glibc being broken in containers, I hope you'll > > excuse the fact that we didn't really have time to co-ordinate updating > > every downstream runc user. >=20 > i think this can be fixed backward compatibly by > returning EPERM for old syscalls. I just remembered why this was a problem (I tried to implement exactly this behaviour when I first worked on the patch) -- in runc we use libseccomp to generate profiles, but libseccomp has several limitations that mean we cannot implement *any* syscall-number-based fallback behaviour (I tried every way I could think of for at least a week or two). The end result is that in runc we use libseccomp to generate the filter, then output the BPF program, then patch it (add a program to the start which does the syscall check and returns ENOSYS in the correct case) and then we run the program. The only other option would be to basically rewrite libseccomp for runc (which I did consider, but then thought better of). Now, having the fallback being a fixed syscall number seems like it would be trivial -- but because we have to use libseccomp's filter generatiion and patch it, patching the return value of the program to be syscall-specific would require patching every return statement in the generated BPF (and then possibly rewriting every jump depending on where the returns are). I think in practice there's only two returns, but hardcoding that is going to cause issues if libseccomp ever changes that behaviour. But I think you're right that this is probably less likely to cause confusion. Unfortunately I'm not really sure that there's a straightforward way to implement it, outside of implementing the behaviour in libseccomp (or at the very least expanding libseccomp to let us work around it). > > > we will see random breakage in the future depending on > > > what unrelated but newer syscalls users added to their > > > whitelist. who thought this was a good idea? > >=20 > > If you update your syscall profile without knowing what you're doing, > > things will break. That will always be the case. > >=20 > > The plan is/was to eventually implement this by explicitly stating a > > minimum kernel version (so that all syscalls missing in the profile that > > were available in that kernel version get ENOSYS) but libseccomp doesn't > > provide that information at the moment, and given that such a filter > > would be more complicated than the one we have at the moment, that > > behaviour probably belongs in libseccomp (there are several issues open > > in the libseccomp repo describing this issue and possible solutions). >=20 > i dont think you need to do anything complicated > with a fixed cut off, e.g. >=20 > return nr < 403 ? EPERM : ENOSYS See my above point about how this is non-trivial. > or you can give an explicit list of syscalls that > should return EPERM for bw compat reasons and the > rest is ENOSYS. The issue with requiring explicit EPERM rules is that libseccomp doesn't support certain sets of argument rule combinations, meaning that you cannot generate nor require users to specify a set of inverse rules to return EPERM for syscalls like clone() where only certain flags are being blocked. (Another one of my attempts was to generate the set of inverse rules, and use ENOSYS as the default.) (Also some of the inverse rules you can technically generate -- most notably the inverse of SCMP_MASKED_EQ requires 2^n rules to be generated (where n is the number of bits in the mask).) > (and there should be an easy way to opt-out of > the bw compat behaviour and always do ENOSYS) You can opt of out of it already, by setting the defaultErrnoRet to ENOSYS. And someone has submitted a patch to Docker to do exactly that[1]. [1]: https://github.com/moby/moby/pull/42649 --=20 Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH --ancdq6rgtettg774 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSxZm6dtfE8gxLLfYqdlLljIbnQEgUCYQQWCAAKCRCdlLljIbnQ EuDbAQC48X75yp/DnOarxQqr9DzM7dqpxVjhFSpA/2xCyfQyKQEAmis6UKq6Nu8b 3d/gtCvktMz6+m3FyCet4qn6N53PQgo= =v/mw -----END PGP SIGNATURE----- --ancdq6rgtettg774--