Hardening Kubernetes with seccomp
In this post I will attempt to demystify the relationship of seccomp
and Kubernetes This first part will look at containers and pods.
Parts of the K8S Security Lab series
Container Runetime Security
- Part1: How to deploy CRI-O with Firecracker?
- Part2: How to deploy CRI-O with gVisor?
- Part3: How to deploy containerd with Firecracker?
- Part4: How to deploy containerd with gVisor?
- Part5: How to deploy containerd with kata containers?
Advanced Kernel Security
- Part1: Hardening Kubernetes with seccomp
- Part2: Linux user namespace management wit CRI-O in Kubernetes
- Part3: Hardening Kubernetes with seccomp
Network Security
- Part1: RKE2 Install With Calico
- Part2: RKE2 Install With Cilium
- Part3: CNI-Genie: network separation with multiple CNI
- Part3: Configurre network wit nmstate operator
- Part3: Kubernetes Network Policy
- Part4: Kubernetes with external Ingress Controller with vxlan
- Part4: Kubernetes with external Ingress Controller with bgp
- Part4: Central authentication with oauth2-proxy
- Part5: Secure your applications with Pomerium Ingress Controller
- Part6: CrowdSec Intrusion Detection System (IDS) for Kubernetes
- Part7: Kubernetes audit logs and Falco
Secure Kubernetes Install
- Part1: Best Practices to keeping Kubernetes Clusters Secure
- Part2: Kubernetes Secure Install
- Part3: Kubernetes Hardening Guide with CIS 1.6 Benchmark
- Part4: Kubernetes Certificate Rotation
User Security
- Part1: How to create kubeconfig?
- Part2: How to create Users in Kubernetes the right way?
- Part3: Kubernetes Single Sign-on with Pinniped OpenID Connect
- Part4: Kubectl authentication with Kuberos Depricated !!
- Part5: Kubernetes authentication with Keycloak and gangway Depricated !!
- Part6: kube-openid-connect 1.0 Depricated !!
Image Security
Pod Security
- Part1: Using Admission Controllers
- Part2: RKE2 Pod Security Policy
- Part3: Kubernetes Pod Security Admission
- Part4: Kubernetes: How to migrate Pod Security Policy to Pod Security Admission?
- Part5: Pod Security Standards using Kyverno
- Part6: Kubernetes Cluster Policy with Kyverno
Secret Security
- Part1: Kubernetes and Vault integration
- Part2: Kubernetes External Vault integration
- Part3: ArgoCD and kubeseal to encript secrets
- Part4: Flux2 and kubeseal to encrypt secrets
- Part5: Flux2 and Mozilla SOPS to encrypt secrets
Monitoring and Observability
- Part6: K8S Logging And Monitoring
- Part7: Install Grafana Loki with Helm3
Backup
With Kubernetes version v1.22 there is a new alpha feature that provides a way to use the RuntimeDefault
as the defaut seccomp profile insted of Unconfined
. By default, when Kubernetes makes a new container it creates with Unconfined
seccomp profile. This means that seccomp filtering is disabled.
Wthat is seccomp profile?
Seccomp (Secure Computing) is a feature in the Linux kernel. It allow to create profiles to filter system calls. Usage of seccomp profiles on containers reduces the chance that a Linux kernel vulnerability will be exploited. All container runtimes ship with a default seccomp profile. The problem come when we using Kubernetes, beasuse Kubernetes use Unconfined
as default and disables seccomp filtering.
For example Docker’s default seccomp profile disables approximately 44 system calls of the 300+ currently availble.
Test Seccomp profile.
For the test I will use amicontained
to inspection tool. First test in a simple docker.
docker run --rm -it r.j3ss.co/amicontained bash
Container Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: docker-default (enforce)
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (60):
SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
Looking for Docker.sock
As you can see with the default Docker secom profile 60 Syscalls are being blocked. Now test wit default Kubernetes config on docker.
kubectl run -it bash --image=r.j3ss.co/amicontained --restart=Never bash
Container Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: unconfined
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: disabled
Blocked Syscalls (21):
MSGRCV SYSLOG SETSID VHANGUP PIVOT_ROOT ACCT SETTIMEOFDAY SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME INIT_MODULE DELETE_MODULE LOOKUP_DCOOKIE KEXEC_LOAD FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE KEXEC_FILE_LOAD BPF
Looking for Docker.sock
In the output above you can see that seccomp is disabled and that 21 syscalls are being blocked. Now test wit default Kubernetes config on rke2 (containerd).
kubectl run -it bash --image=r.j3ss.co/amicontained --restart=Never bash
Container Runtime: kube
Has Namespaces:
pid: true
user: false
AppArmor Profile: system_u:system_r:container_t:s0:c575,c847
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: disabled
Blocked Syscalls (22):
SYSLOG SETPGID SETSID VHANGUP PIVOT_ROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME INIT_MODULE DELETE_MODULE LOOKUP_DCOOKIE KEXEC_LOAD FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE KEXEC_FILE_LOAD BPF
Looking for Docker.sock
pod default/bash terminated (Error)
The containerd
is similar then docker so lets test with CRI-O
.
kubectl run -it bash --image=bash --restart=Never bash
# apk add curl
# curl -LO k8s.work/amicontained
# chmod +x amicontained
# ./amicontained
Container Runtime: kube
Has Namespaces:
pid: true
user: false
AppArmor Profile: system_u:system_r:spc_t:s0
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service
Seccomp: disabled
Blocked Syscalls (22):
MSGRCV SYSLOG SETSID VHANGUP PIVOT_ROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME INIT_MODULE DELETE_MODULE LOOKUP_DCOOKIE KEXEC_LOAD FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE KEXEC_FILE_LOAD BPF
Looking for Docker.sock
Enable RuntimeDefault seccomp profile
Enable in local kubelet config:
nano /var/lib/kubelet/config.yaml
...
--feature-gates="...,SeccompDefault=true"
--seccomp-default RuntimeDefault
systemctl restart kubelet
Enable in running kubelet config:
kubectl edit cm kubelet-config-1.22 -n kub-system
...
- --feature-gates="...,SeccompDefault=true"
- --seccomp-default RuntimeDefault
Then test:
kubectl run -it bash --image=r.j3ss.co/amicontained --restart=Never bash
Container Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: unconfined
Capabilities:
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (61):
MSGRCV PTRACE SYSLOG SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
Looking for Docker.sock
Customizing a Profile
One way to write seccomp
filter is to use Berkeley packet filter (BPF) language. Using this language isn’t really simple or convenient. We can write JSON that is compiled into profile by libseccomp
.
If you were to create a profile to allow a container to execute a ping against a website, you can use strace
command to find the syscalls it makes:
strace -fqc ping -c 20 www.google.com% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
29.55 0.000078 4 20 11 openat
14.02 0.000037 9 4 4 socket
11.74 0.000031 3 12 mprotect
6.06 0.000016 2 7 read
5.68 0.000015 1 17 mmap
5.68 0.000015 3 5 capget
4.92 0.000013 13 1 munmap
3.79 0.000010 1 9 fstat
3.41 0.000009 9 1 write
3.41 0.000009 1 9 close
2.65 0.000007 2 3 brk
2.65 0.000007 4 2 prctl
2.27 0.000006 3 2 getuid
1.52 0.000004 4 1 setuid
1.52 0.000004 4 1 capset
1.14 0.000003 3 1 geteuid
0.00 0.000000 0 9 9 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 3 fcntl
0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.000264 109 24 total
A nothe solution is a tool called zaz created by Paulo Gomes That generate a seccomp prifile for you with the minimum system calls:
zaz seccomp docker alpine "ping -c5 8.8.8.8"
A basic seccomp has three key elements: the defaultAction
, the architectures
(or archMap
) and the syscalls
:
mkdir /var/lib/kubelet/seccomp
nano /var/lib/kubelet/seccomp/sample.json
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"arch_prctl",
"sched_yield",
"futex",
"write",
"mmap",
"exit_group",
"madvise",
"rt_sigprocmask",
"getpid",
"gettid",
"tgkill",
"rt_sigaction",
"read",
"getpgrp"
],
"action": "SCMP_ACT_ALLOW",
"args": [],
"comment": "",
"includes": {},
"excludes": {}
}
]
}
The defaultAction
is SCMP_ACT_ERRNO
which will block the execution of any system call. The we list the syscalls
what we want to whitelist.
The different types of actions
Below is a list of all the different types of actions and what they do:
SCMP_ACT_KILL_THREAD (or SCMP_ACT_KILL)
Does not execute the syscall and terminate the thread that attempted making the call. Note that depending on the application being enforced (i.e. multi-threading) and its error handling, syscalls blocked using this action may do so silently which may result in side effects on the overall application.
SCMP_ACT_TRAP
Does not execute the syscall. The kernel will send a thread-directed SIGSYS signal to the thread that attempted making the call.
SCMP_ACT_ERRNO
Does not execute the syscall, returns error instead. Note that depending on the error handling of the application being enforced, syscalls blocked using this action may do so silently which may result in side effects on the overall application.
SCMP_ACT_TRACE
The decision on whether or not to execute the syscall will come from a tracer. If no tracer is present behaves like SECCOMP_RET_ERRNO.
This can be used to automate profile generation and also can be used to change the syscall being made. Not recommended when trying to enforce seccomp to line of business applications.
SCMP_ACT_ALLOW
Executes the syscall.
SCMP_ACT_LOG (since Linux 4.14)
Executes the syscall. Useful for running seccomp in "complain-mode", logging the syscalls that are mapped (or catch-all) and not blocking their execution. It can be used together with other action types to provide an allow and deny list approach.
SCMP_ACT_KILL_PROCESS (since Linux 4.14)
Does not execute the syscall and terminates the entire process with a core dump. Very useful when automating the profile generation.
Configure on a pod
With Kubernetes version v1.19 Seccomp Profile for a Container is GA. The sintax looks like this:
apiVersion: v1
kind: Pod
metadata:
name: some-pod
labels:
app: some-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: sample.json
containers:
...
Valid options for type
include RuntimeDefault
, Unconfined
, and Localhost
. Here is an example that sets the Seccomp profile to the node’s container runtime default profile:
apiVersion: v1
kind: Pod
metadata:
name: some-pod
labels:
app: some-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
...
This Configuration is the same then setting SeccompDefault=true
in kubelet config.
Add audit profile
Since linux kernel 4.14 it is now possible to define parts of your profile to run in audit mode, logging into syslog all the system calls you want without blocking them. To do that you can use the action SCMT_ACT_LOG:
nano /var/lib/kubelet/seccomp/audit.json
{
"defaultAction": "SCMP_ACT_LOG",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"arch_prctl",
"sched_yield",
"futex",
"write",
"mmap",
"exit_group",
"madvise",
"rt_sigprocmask",
"getpid",
"gettid",
"tgkill",
"rt_sigaction",
"read",
"getpgrp"
],
"action": "SCMP_ACT_ALLOW"
},
{
"names": [
"add_key",
"keyctl",
"ptrace"
],
"action": "SCMP_ACT_ERRNO"
}
]
}
~ $ tail /var/log/syslog
Nov 25 19:38:18 kernel: [461698.749294] audit: ... syscall=21 compat=0 ip=0x7ff8f8412d5b code=0x7ffc0000 # access
Nov 25 19:38:18 kernel: [461698.749306] audit: ... syscall=257 compat=0 ip=0x7ff8f8412ec8 code=0x7ffc0000 # openat
Nov 25 19:38:18 kernel: [461698.749315] audit: ... syscall=5 compat=0 ip=0x7ff8f8412c99 code=0x7ffc0000 # fstat
Nov 25 19:38:18 kernel: [461698.749317] audit: ... syscall=9 compat=0 ip=0x7ff8f84130e6 code=0x7ffc0000 # mmap
Nov 25 19:38:18 kernel: [461698.749323] audit: ... syscall=3 compat=0 ip=0x7ff8f8412d8b code=0x7ffc0000 # close
Set capabilities for a Container
With version 1.22 you should be able to change the sysctl in the security context of your pod manifests, allowing containers that are running as unprivileged users to bind low ports.
securityContext:
sysctls:
- name: net.ipv4.ip_unprivileged_port_start
value: "1"
Final Words
Whatever you define in your seccomp profile, the kernel will enforce it. Even if that is not what you want. For example, if you block access to calls such as exit or exit_group your container may not be able to exit and it could trap the container in an exit loop indefinitely. Leading to high CPU usage of your cluster.