Kubernetes Swap and etcd Stability: Preventing Control Plane Hangs

Kubernetes Swap and etcd Stability: Preventing Control Plane Hangs
Page content

When enabling swap on Kubernetes nodes, you might encounter a critical issue where misbehaving containers don’t get killed automatically. When this affects etcd, the API server generating excessive load and consuming all available resources. This post explains the problem and provides two solutions.

Parts of the K8S Security Lab series

Container Runetime Security
Advanced Kernel Security
Network Security
Secure Kubernetes Install
User Security
Image Security
  • Part1: Image security Admission Controller
  • Part2: Image security Admission Controller V2
  • Part3: Image security Admission Controller V3
  • Part4: Continuous Image security
  • Part5: trivy-operator 1.0
  • Part6: trivy-operator 2.1: Trivy-operator is now an Admisssion controller too!!!
  • Part7: trivy-operator 2.2: Patch release for Admisssion controller
  • Part8: trivy-operator 2.3: Patch release for Admisssion controller
  • Part8: trivy-operator 2.4: Patch release for Admisssion controller
  • Part8: trivy-operator 2.5: Patch release for Admisssion controller
  • Part9_ Image Signature Verification with Connaisseur
  • Part10: Image Signature Verification with Connaisseur 2.0
  • Part11: Image Signature Verification with Kyverno
  • Part12: How to use imagePullSecrets cluster-wide??
  • Part13: Automatically change registry in pod definition
  • Part14: ArgoCD auto image updater
    Pod Security
    Secret Security
    Monitoring and Observability
    Backup

    When enabling swap on Kubernetes nodes, you might encounter a critical issue where misbehaving containers don’t get killed automatically—they hang indefinitely. When this affects etcd on control plane nodes, the consequences are severe: the API server continuously tries to communicate with the local etcd instance, generating excessive load and consuming all available resources.

    The Problem: Swap and Container Memory Management

    Kubernetes was designed with the assumption that swap is disabled. When swap is enabled, the kubelet’s ability to enforce memory limits is compromised. Here’s what happens:

    graph TD
        A[Container Memory Pressure] --> B{Swap Enabled?}
        B -->|Yes| C[Memory Swapped to Disk]
        B -->|No| D[Container OOM Killed]
        C --> E[Container Hangs/Freezes]
        D --> F[Container Restarted]
        E --> G[Service Degradation]
        F --> H[Service Recovers]
    

    Why etcd Suffers Most

    When etcd runs on a control plane node with swap enabled and experiences memory pressure:

    1. etcd doesn’t get killed - Instead of an OOM kill and restart, it swaps to disk
    2. etcd becomes unresponsive - Disk I/O is orders of magnitude slower than RAM
    3. API server keeps retrying - The kube-apiserver constantly attempts to reach the local etcd
    4. Resource exhaustion - Retry loops consume CPU and network resources
    5. Control plane cascade failure - Other components start failing
    sequenceDiagram
        participant API as kube-apiserver
        participant ETCD as etcd (swapped)
        participant KUBELET as kubelet
        participant SCHED as scheduler
        
        ETCD->>ETCD: Memory pressure
        ETCD->>ETCD: Swap to disk (slow)
        API->>ETCD: Request
        ETCD-->>API: Timeout (no response)
        API->>ETCD: Retry request
        ETCD-->>API: Timeout
        loop Retry Loop
            API->>ETCD: Continuous requests
            Note over API,ETCD: CPU spike + Network saturation
        end
        KUBELET->>ETCD: Health check
        KUBELET-->>KUBELET: etcd unhealthy but not dead
        SCHED->>API: Cannot schedule pods
        Note over SCHED: Cluster degraded
    endloop
    

    Solution 1: Set Resource Limits for Control Plane Components

    The first line of defense is setting explicit resource limits for all control plane components. This prevents any single component from consuming all available resources.

    Post-Installation Configuration

    After cluster installation, create static pod manifests with resource limits:

    # Backup existing manifests
    sudo cp -r /etc/kubernetes/manifests /etc/kubernetes/manifests.backup
    

    kube-apiserver Resources

    Edit /etc/kubernetes/manifests/kube-apiserver.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-apiserver
      namespace: kube-system
    spec:
      containers:
      - name: kube-apiserver
        image: registry.k8s.io/kube-apiserver:v1.29.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        # ... rest of config
    

    kube-controller-manager Resources

    Edit /etc/kubernetes/manifests/kube-controller-manager.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-controller-manager
      namespace: kube-system
    spec:
      containers:
      - name: kube-controller-manager
        image: registry.k8s.io/kube-controller-manager:v1.29.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
    

    kube-scheduler Resources

    Edit /etc/kubernetes/manifests/kube-scheduler.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-scheduler
      namespace: kube-system
    spec:
      containers:
      - name: kube-scheduler
        image: registry.k8s.io/kube-scheduler:v1.29.0
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
    

    etcd Resources

    Edit /etc/kubernetes/manifests/etcd.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: etcd
      namespace: kube-system
    spec:
      containers:
      - name: etcd
        image: registry.k8s.io/etcd:3.5.10-0
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        env:
        - name: ETCD_MEM_LIMIT
          value: "2Gi"
    

    Note: After modifying static pod manifests, the kubelet will automatically restart the pods with new configurations.

    Configuration via kubeadm (During Installation)

    For new clusters, configure resource limits during initialization using a kubeadm configuration file.

    Create kubeadm-config.yaml:

    apiVersion: kubeadm.k8s.io/v1beta3
    kind: ClusterConfiguration
    kubernetesVersion: v1.29.0
    controlPlaneEndpoint: "lb.example.com:6443"
    
    # API Server configuration
    apiServer:
      extraArgs:
        authorization-mode: Node,RBAC
      timeoutForControlPlane: 4m0s
      extraVolumes: []
    
    # Controller Manager configuration
    controllerManager:
      extraArgs:
        bind-address: 0.0.0.0
      extraVolumes: []
    
    # Scheduler configuration
    scheduler:
      extraArgs:
        bind-address: 0.0.0.0
      extraVolumes: []
    
    # etcd configuration
    etcd:
      local:
        dataDir: /var/lib/etcd
        extraArgs:
          quota-backend-bytes: "8589934592"  # 8GB
          max-request-bytes: "10485760"       # 10MB
    
    # Network configuration
    networking:
      dnsDomain: cluster.local
      serviceSubnet: 10.96.0.0/12
      podSubnet: 10.244.0.0/16
    
    ---
    apiVersion: kubeadm.k8s.io/v1beta3
    kind: InitConfiguration
    localAPIEndpoint:
      advertiseAddress: "192.168.1.10"
      bindPort: 6443
    nodeRegistration:
      criSocket: unix:///var/run/containerd/containerd.sock
      imagePullPolicy: IfNotPresent
      name: "control-plane-1"
      taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
    
    ---
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    # Important: Configure memory management
    memorySwap:
      swapBehavior: LimitedSwap  # Allow limited swap
    maxPods: 110
    serializeImagePulls: false
    

    Initialize the cluster:

    sudo kubeadm init --config kubeadm-config.yaml
    

    For joining control plane nodes, create kubeadm-join-config.yaml:

    apiVersion: kubeadm.k8s.io/v1beta3
    kind: JoinConfiguration
    controlPlane:
      localAPIEndpoint:
        advertiseAddress: "192.168.1.11"
        bindPort: 6443
    nodeRegistration:
      criSocket: unix:///var/run/containerd/containerd.sock
      imagePullPolicy: IfNotPresent
      name: "control-plane-2"
      taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
    discovery:
      bootstrapToken:
        apiServerEndpoint: "lb.example.com:6443"
        token: "abcdef.0123456789abcdef"
        unsafeSkipCAVerification: false
    
    sudo kubeadm join --config kubeadm-join-config.yaml
    

    Solution 2: Deploy etcd LoadBalancer DaemonSet

    The second solution distributes etcd traffic across all control plane nodes, preventing any single node’s etcd from becoming a bottleneck.

    Architecture Overview

    graph LR
        subgraph "Control Plane Nodes"
            A[kube-apiserver] --> B[Local LB DaemonSet]
            B --> C{Load Balancer}
            C --> D[etcd-1:2379]
            C --> E[etcd-2:2379]
            C --> F[etcd-3:2379]
        end
        
        subgraph "etcd Cluster"
            D
            E
            F
        end
        
        style B fill:#f9f,stroke:#333
        style C fill:#bbf,stroke:#333
    

    Step 1: Create etcd LoadBalancer Service

    Create etcd-lb.yaml:

    apiVersion: v1
    kind: Service
    metadata:
      name: etcd-loadbalancer
      namespace: kube-system
      labels:
        app: etcd-lb
    spec:
      type: ClusterIP
      clusterIP: None  # Headless service
      ports:
      - name: etcd-client
        port: 2379
        targetPort: 2379
        protocol: TCP
      selector:
        component: etcd
    ---
    apiVersion: v1
    kind: Endpoints
    metadata:
      name: etcd-loadbalancer
      namespace: kube-system
      labels:
        app: etcd-lb
    subsets:
    - addresses:
      - ip: 192.168.1.10  # control-plane-1
      - ip: 192.168.1.11  # control-plane-2
      - ip: 192.168.1.12  # control-plane-3
      ports:
      - name: etcd-client
        port: 2379
        protocol: TCP
    

    Step 2: Deploy HAProxy DaemonSet

    Create etcd-haproxy-daemonset.yaml:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: etcd-haproxy
      namespace: kube-system
      labels:
        app: etcd-haproxy
    spec:
      selector:
        matchLabels:
          app: etcd-haproxy
      template:
        metadata:
          labels:
            app: etcd-haproxy
        spec:
          hostNetwork: true
          dnsPolicy: ClusterFirstWithHostNet
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
          - operator: Exists
            effect: NoSchedule
          containers:
          - name: haproxy
            image: haproxy:2.8-alpine
            resources:
              requests:
                memory: "64Mi"
                cpu: "50m"
              limits:
                memory: "128Mi"
                cpu: "100m"
            ports:
            - containerPort: 2380
              hostPort: 2380
              name: etcd-lb
            volumeMounts:
            - name: haproxy-config
              mountPath: /usr/local/etc/haproxy
            livenessProbe:
              tcpSocket:
                port: 2380
              initialDelaySeconds: 10
              periodSeconds: 10
            readinessProbe:
              tcpSocket:
                port: 2380
              initialDelaySeconds: 5
              periodSeconds: 5
          volumes:
          - name: haproxy-config
            configMap:
              name: etcd-haproxy-config
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: etcd-haproxy-config
      namespace: kube-system
    data:
      haproxy.cfg: |
        global
          log stdout format raw local0
          maxconn 4096
        
        defaults
          log global
          mode tcp
          timeout connect 5s
          timeout client 30s
          timeout server 30s
          option dontlognull
        
        frontend etcd-frontend
          bind *:2380
          default_backend etcd-servers
        
        backend etcd-servers
          balance roundrobin
          option httpchk GET /health
          http-check expect status 200
          server etcd-1 192.168.1.10:2379 check inter 5s fall 3 rise 2
          server etcd-2 192.168.1.11:2379 check inter 5s fall 3 rise 2
          server etcd-3 192.168.1.12:2379 check inter 5s fall 3 rise 2
    

    Apply the configuration:

    kubectl apply -f etcd-haproxy-daemonset.yaml
    

    Step 3: Configure API Server to Use LoadBalancer

    Modify the kube-apiserver manifest to point to the local HAProxy instance:

    Edit /etc/kubernetes/manifests/kube-apiserver.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-apiserver
      namespace: kube-system
    spec:
      containers:
      - name: kube-apiserver
        command:
        - kube-apiserver
        - --etcd-servers=http://127.0.0.1:2380  # Point to local HAProxy
        # ... rest of config
    

    The kubelet will automatically restart the API server with the new configuration.

    Verify the Setup

    # Check HAProxy pods are running
    kubectl get pods -n kube-system -l app=etcd-haproxy
    
    # Verify etcd connectivity
    kubectl exec -n kube-system etcd-control-plane-1 -- \
      etcdctl --endpoints=http://127.0.0.1:2380 endpoint health
    
    # Check API server logs
    kubectl logs -n kube-system kube-apiserver-control-plane-1 | grep -i etcd
    

    Additional Recommendations

    1. Monitor etcd Performance

    Deploy monitoring for etcd metrics:

    # etcd metrics service
    apiVersion: v1
    kind: Service
    metadata:
      name: etcd-metrics
      namespace: kube-system
    spec:
      type: ClusterIP
      clusterIP: None
      ports:
      - name: metrics
        port: 2381
        targetPort: 2381
      selector:
        component: etcd
    

    Key metrics to watch:

    • etcd_server_has_leader - Leader election status
    • etcd_server_leader_changes_seen_total - Leader change frequency
    • etcd_mvcc_db_total_size_in_bytes - Database size
    • etcd_disk_backend_commit_duration_seconds - Disk commit latency

    2. Consider Disabling Swap

    If possible, the best solution is to disable swap entirely:

    # Disable swap immediately
    sudo swapoff -a
    
    # Remove swap from fstab
    sudo sed -i '/swap/d' /etc/fstab
    
    # Verify
    free -h  # Swap should show 0
    

    3. Use Dedicated etcd Nodes

    For production clusters, consider running etcd on dedicated nodes separate from the control plane components.

    Conclusion

    Running Kubernetes with swap enabled introduces significant risks, especially for etcd stability. The combination of:

    1. Resource limits on control plane components
    2. Load balancing etcd traffic across all control plane nodes
    3. Proper monitoring of etcd health metrics

    provides defense-in-depth against control plane failures. However, for production environments, disabling swap remains the recommended approach.

    graph TD
        A[Kubernetes with Swap] --> B{Risk Level}
        B -->|No Protection| C[High Risk: Control Plane Failure]
        B -->|Resource Limits Only| D[Medium Risk: Partial Protection]
        B -->|Limits + LB| E[Lower Risk: Distributed Load]
        B -->|Swap Disabled| F[Best Practice: No Swap Issues]
        
        style C fill:#f96
        style D fill:#ff9
        style E fill:#ff9
        style F fill:#9f9
    

    Remember: A stable etcd is the foundation of a healthy Kubernetes cluster. Invest in proper resource management and monitoring from the start.