On 2021-07-27, Szabolcs Nagy wrote: > The 07/27/2021 20:22, Aleksa Sarai wrote: > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > > > The way it works is that any syscall which has a larger syscall number > > than any syscall specified in the filter will get -ENOSYS (this works > > even if libseccomp is outdated). The only way you could get the -EPERM > > behaviour with modern runc is if you write a seccomp profile that had > > rules for newer syscalls (openat2 for instance) but not clone3 -- but > > Docker doesn't do that. (The reason for this slightly convoluted > > behaviour was to make sure that intentional omissions actually give you > > -EPERM.) > > this sounds broken. it really should return ENOSYS unless > a user specifically asked for a different errno value for > a syscall. EPERM is just wrong. Yes, if I was designing it from scratch, that's what I would've done. But there are already existing filters that are written assuming the default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for existing profiles is not a workable solution. Should we fix all existing profiles and then change the behaviour again? Sure, but given we solved this problem in a period of time when people were screaming about glibc being broken in containers, I hope you'll excuse the fact that we didn't really have time to co-ordinate updating every downstream runc user. > we will see random breakage in the future depending on > what unrelated but newer syscalls users added to their > whitelist. who thought this was a good idea? If you update your syscall profile without knowing what you're doing, things will break. That will always be the case. The plan is/was to eventually implement this by explicitly stating a minimum kernel version (so that all syscalls missing in the profile that were available in that kernel version get ENOSYS) but libseccomp doesn't provide that information at the moment, and given that such a filter would be more complicated than the one we have at the moment, that behaviour probably belongs in libseccomp (there are several issues open in the libseccomp repo describing this issue and possible solutions). -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH