Containers vs. Pods - Deepdyve

In this post we will take a look at the difference between containers and pods.

With the wide usage of Docker/OICD Containers, they become a replacement for vm-s. This solution is based on Micro services best prentices that means you are using one service per container. This can be a problem some situation and blocks moving from vm to container. While thar is a few workaround for running multiple service, Kubernetes hes a solution in a for of ‘Pods’. Pods are the smallest deployable units of Kubernetes. It is a group of one or more containers, with shared resources. Every pod gets a unique IP. More from this in the networking post. This means you cannot run the same ports on different container in the same namespace. Every container in a pod gets an isolated filesystem and that from inside one container, you don’t see processes running in other containers of the same pod. But containers in one pod can communicate via shared memory!

Containers dose not necessary means Docker. There are more container technologies lik LXC, OpenVZ, and more. The difference in this technologies are the different type of isolation for processes running in the container. As we talked about in a prewious post Docker/OICD Containers standards are based on the OCI Runtime Spec.

Namespace Isolation

Let’s see what isolation primitives were created when I started the container:

# Look up the container in the process tree.
$ ps auxf
USER       PID  ...  COMMAND
...
root      4707       /usr/bin/containerd-shim-runc-v2 -namespace moby -id cc9466b3e...
root      4727        \_ nginx: master process nginx -g daemon off;
systemd+  4781            \_ nginx: worker process
systemd+  4782            \_ nginx: worker process

# Find the namespaces used by 4727 process.
$ sudo lsns
        NS TYPE   NPROCS   PID USER    COMMAND
...
4026532157 mnt         3  4727 root    nginx: master process nginx -g daemon off;
4026532158 uts         3  4727 root    nginx: master process nginx -g daemon off;
4026532159 ipc         3  4727 root    nginx: master process nginx -g daemon off;
4026532160 pid         3  4727 root    nginx: master process nginx -g daemon off;
4026532162 net         3  4727 root    nginx: master process nginx -g daemon off;

As you can see Docker user multiple namespaces to isolate the conatiners:

  • mnt - isolated mount table
  • uts - the container has its own hostname and domain name
  • ipc, pid - only to processes inside the same container can communicate with eache other
  • net - the container gets its own network stack

User ID namespace is not used by default but you can run docker with rootless and use user isolation.

There is one other major namespace and that is Cgroup. Cgroup can be use to limits hungry processes to accidentally consume all the host’s resources.

Check Cgroup limits

We can examining he corresponding subtree in the cgroup virtual filesystem. cgroupfs is mounted as /sys/fs/cgroup and for processes /proc/<PID>/cgroup.

Lets run a container for testing:

docker run --name foo --rm -d --memory='512MB' --cpus='0.5' nginx
# et pid for container
PID=$(docker inspect --format '{{.State.Pid}}' foo)

# Check cgroupfs node for the container main process (4727).
$ cat /proc/${PID}/cgroup
11:freezer:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
10:blkio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
9:rdma:/
8:pids:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
7:devices:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
6:cpuset:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
5:cpu,cpuacct:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
4:memory:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
3:net_cls,net_prio:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
2:perf_event:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
1:name=systemd:/docker/cc9466b3eb67ca374c925794776aad2fd45a34343ab66097a44594b35183dba0
0::/system.slice/containerd.service

# Check the memory limit.
$ cat /sys/fs/cgroup/memory/docker/${ID}/memory.limit_in_bytes
536870912  # Yay! It's the 512MB we requested!

# See the CPU limits.
ls /sys/fs/cgroup/cpu/docker/${ID}

Examining a Kubernetes pod

To keep the Docker Containers and Kubernetes Pods fair comparison I use a Docker as the engine in Kubernetes. Now I start a pod for testing:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  containers:
    - name: app
      image: docker.io/kennethreitz/httpbin
      ports:
        - containerPort: 80
      resources:
        limits:
          memory: "256Mi"
    - name: sidecar
      image: curlimages/curl
      command: ["/bin/sleep", "3650d"]
      resources:
        limits:
          memory: "128Mi"
EOF

The actual pod inspection should be done on the Kubernetes cluster node:

$ ps auxf
USER       PID  ...  COMMAND
...
root      4947         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
root      4966             \_ /pause
root      4981         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
root      5001             \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
root      5016                 \_ /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
root      5018         \_ containerd-shim -namespace k8s.io -workdir /mnt/sda1/var/lib/containerd/...
100       5035             \_ /bin/sleep 3650d

The above three process groups were created during the pod startup. That’s interesting because in the manifest, only two containers, httpbin and sleep, were requested.

Every Kubernetes Pod includes an empty pause container, which bootstraps the Pod to establish all of the cgroups, reservations, and namespaces before its individual containers are created. The pause container image is always present, so the pod resource allocation happens instantaneously as containers are created.

Display pause containers:

docker ps | grep -I pause

80282e0baa43   mirantis/ucp-pause:3.4.9   "/pause"   3 minutes ago   Up 3 minutes   k8s_POD_foo-wxk5l_default_030fc501-ec75-4675-a742-d19929818065_0

Here is how the namespaces look like on the cluster node:

sudo lsns
        NS TYPE   NPROCS   PID USER            COMMAND
4026532614 net         4  4966 root            /pause
4026532715 mnt         1  4966 root            /pause
4026532716 uts         4  4966 root            /pause
4026532717 ipc         4  4966 root            /pause
4026532718 pid         1  4966 root            /pause
4026532719 mnt         2  5001 root            /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532720 pid         2  5001 root            /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532721 mnt         1  5035 100             /bin/sleep 3650d
4026532722 pid         1  5035 100             /bin/sleep 3650d

So, much like the Docker container in the first section, the pause container gets all five namespaces - net, mnt, uts, ipc, and pid. But apparently, httpbin and sleep containers get just by two namespaces mnt and pid.

With the proc we can get tha semo for all the pods:

# httpbin container
sudo ls -l /proc/5001/ns
...
lrwxrwxrwx 1 root root 0 Apr 24 14:05 ipc -> 'ipc:[4026532717]'
lrwxrwxrwx 1 root root 0 Apr 24 14:05 mnt -> 'mnt:[4026532719]'
lrwxrwxrwx 1 root root 0 Apr 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 root root 0 Apr 24 14:05 pid -> 'pid:[4026532720]'
lrwxrwxrwx 1 root root 0 Apr 24 14:05 uts -> 'uts:[4026532716]'

# sleep container
sudo ls -l /proc/5035/ns
...
lrwxrwxrwx 1 100 101 0 Apr 24 14:05 ipc -> 'ipc:[4026532717]'
lrwxrwxrwx 1 100 101 0 Apr 24 14:05 mnt -> 'mnt:[4026532721]'
lrwxrwxrwx 1 100 101 0 Apr 24 14:05 net -> 'net:[4026532614]'
lrwxrwxrwx 1 100 101 0 Apr 24 14:05 pid -> 'pid:[4026532722]'
lrwxrwxrwx 1 100 101 0 Apr 24 14:05 uts -> 'uts:[4026532716]'

While it might be tricky to notice, the httpbin and sleep containers actually reuse the net, uts, and ipc namespaces of the pause container!

With kubernetes config the hostIPC, hostNetwork, and hostPID flags can make the containers use the corresponding host’s namespaces.

Inspecting pod’s cgroups

Check the cgroups for pods:

$ sudo systemd-cgls
Control group /:
-.slice
├─kubepods
│ ├─burstable
│ │ ├─pod4a8d5c3e-3821-4727-9d20-965febbccfbb
│ │ │ ├─f0e87a93304666766ab139d52f10ff2b8d4a1e6060fc18f74f28e2cb000da8b2
│ │ │ │ └─4966 /pause
│ │ │ ├─dfb1cd29ab750064ae89613cb28963353c3360c2df913995af582aebcc4e85d8
│ │ │ │ ├─5001 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ │ │ └─5016 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ │ └─097d4fe8a7002d69d6c78899dcf6731d313ce8067ae3f736f252f387582e55ad
│ │ │   └─5035 /bin/sleep 3650d

So, the pod itself gets a parent node, and every container can be tweaked separately as well.

Implementing Pods with Docker

Because Pod under the hood is implemented as a bunch of shared namespaced containers with a common cgroup parent, I will try to reproduce this with Docker because Docker cannot manage pods.

Docker allows creating a container that reuses an existing network namespace. I’ll use an extra package to simplify dealing with cgroups:

sudo apt-get install cgroup-tools

Firstly, a parent cgroup entry needs to be configured. For the sake of simplecity, I’ll use only cpu and memory controllers:

sudo cgcreate -g cpu,memory:/pod-foo

# Check if the corresponding folders were created:
ls -l /sys/fs/cgroup/cpu/pod-foo/
ls -l /sys/fs/cgroup/memory/pod-foo/

Secondly, a sandbox (paus) container should be created:

$ docker run -d --rm \
  --name foo_sandbox \
  --cgroup-parent /pod-foo \
  --ipc 'shareable' \
  alpine sleep infinity

Lastly, starting the actual containers reusing the namespaces of the sandbox container:

# app (httpbin)
$ docker run -d --rm \
  --name app \
  --cgroup-parent /pod-foo \
  --network container:foo_sandbox \
  --ipc container:foo_sandbox \
  kennethreitz/httpbin

# sidecar (sleep)
$ docker run -d --rm \
  --name sidecar \
  --cgroup-parent /pod-foo \
  --network container:foo_sandbox \
  --ipc container:foo_sandbox \
  curlimages/curl sleep 365d

I couldn’t share the uts namespace between containers, because Docker not allow to configure that. You can use only just the host’s uts namespace. But apart from the uts namespace, it’s a success!

The cgroups look much like the ones created by Kubernetes itself:

$ sudo systemd-cgls memory
Controller memory; Control group /:
├─pod-foo
│ ├─488d76cade5422b57ab59116f422d8483d435a8449ceda0c9a1888ea774acac7
│ │ ├─27865 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ │ └─27880 /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
│ ├─9166a87f9a96a954b10ec012104366da9f1f6680387ef423ee197c61d37f39d7
│ │ └─27977 sleep 365d
│ └─c7b0ec46b16b52c5e1c447b77d67d44d16d78f9a3f93eaeb3a86aa95e08e28b6
│   └─27743 sleep infinity

The global list of namespaces also looks familiar:

$ sudo lsns
        NS TYPE   NPROCS   PID USER    COMMAND
...
4026532157 mnt         1 27743 root    sleep infinity
4026532158 uts         1 27743 root    sleep infinity
4026532159 ipc         4 27743 root    sleep infinity
4026532160 pid         1 27743 root    sleep infinity
4026532162 net         4 27743 root    sleep infinity
4026532218 mnt         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532219 uts         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532220 pid         2 27865 root    /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026532221 mnt         1 27977 _apt    sleep 365d
4026532222 uts         1 27977 _apt    sleep 365d
4026532223 pid         1 27977 _apt    sleep 365d

And the httpbin and sidecar containers seems to share the ‘ipc’ and ‘net’ namespaces:

# app container
$ sudo ls -l /proc/27865/ns
lrwxrwxrwx 1 root root 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 mnt -> 'mnt:[4026532218]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 pid -> 'pid:[4026532220]'
lrwxrwxrwx 1 root root 0 Oct 28 07:56 uts -> 'uts:[4026532219]'

# sidecar container
$ sudo ls -l /proc/27977/ns
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 ipc -> 'ipc:[4026532159]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 mnt -> 'mnt:[4026532221]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 net -> 'net:[4026532162]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 pid -> 'pid:[4026532223]'
lrwxrwxrwx 1 _apt systemd-journal 0 Oct 28 07:56 uts -> 'uts:[4026532222]'

podman Pods

There is a Docker alternative OICD compatible Container engine called podman that can be managed pods of containers like Kubernetes.

First I start a pod wit some containers to test:

$ sudo podman pod create --name my_pod

$ sudo podman pod list

POD ID        NAME    STATUS   CREATED        INFRA ID      # OF CONTAINERS
0e0862e977e1  my_pod     Created  9 seconds ago  19e248401b83  1

$ sudo podman ps -a --pod

CONTAINER ID  IMAGE                 COMMAND  CREATED         STATUS   PORTS   NAMES               POD ID        PODNAME
19e248401b83  k8s.gcr.io/pause:3.5           13 seconds ago  Created          0e0862e977e1-infra  0e0862e977e1  my_pod

$ sudo podman run -dt --pod my_pod docker.io/curlimages/curl sleep 365d

$ sudo podman run -dt --pod my_pod docker.io/kennethreitz/httpbin

$ sudo podman ps -a --pod

CONTAINER ID  IMAGE                                  COMMAND               CREATED             STATUS                 PORTS   NAMES               POD ID        PODNAME
b4f923a0af26  k8s.gcr.io/pause:3.5                                         About a minute ago  Up About a minute ago          b49582203b1a-infra  b49582203b1a  my_pod
51f3c3ea6959  docker.io/curlimages/curl:latest       sleep 365d            About a minute ago  Up About a minute ago          gracious_hellman    b49582203b1a  my_pod
2a5b513a3089  docker.io/kennethreitz/httpbin:latest  gunicorn -b 0.0.0...  3 seconds ago       Up 3 seconds ago               laughing_villani    b49582203b1a  my_pod

Ten we can list the used namespaces:

$ sudo lsns
4026534263 net         4 25012 root             /pause
4026534335 mnt         1 25012 root             /pause
4026534336 uts         4 25012 root             /pause
4026534337 ipc         4 25012 root             /pause
4026534338 pid         1 25012 root             /pause
4026534340 mnt         1 25023 systemd-network  sleep 365d
4026534341 pid         1 25023 systemd-network  sleep 365d
4026534342 cgroup      1 25023 systemd-network  sleep 365d
4026534344 mnt         2 30514 root             /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026534345 pid         2 30514 root             /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent
4026534346 cgroup      2 30514 root             /usr/bin/python3 /usr/local/bin/gunicorn -b 0.0.0.0:80 httpbin:app -k gevent

The containers shar the ‘ipc’, ‘net’, ‘time’, ‘user’ and ‘uts’ namespaces.

# httpbin container
sudo ls -l /proc/30514/ns
lrwxrwxrwx 1 root root 0 ápr   28 18:54 cgroup -> 'cgroup:[4026534346]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 ipc -> 'ipc:[4026534337]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 mnt -> 'mnt:[4026534344]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 net -> 'net:[4026534263]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 pid -> 'pid:[4026534345]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 ápr   28 18:54 uts -> 'uts:[4026534336]'


# curl container
sudo ls -l /proc/25023/ns
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 cgroup -> 'cgroup:[4026534342]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 ipc -> 'ipc:[4026534337]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 mnt -> 'mnt:[4026534340]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 net -> 'net:[4026534263]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 pid -> 'pid:[4026534341]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 time -> 'time:[4026531834]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 user -> 'user:[4026531837]'
lrwxrwxrwx 1 systemd-network systemd-journal 0 ápr   28 18:54 uts -> 'uts:[4026534336]'