Kubernetes Swap and etcd Stability: Preventing Control Plane Hangs
When enabling swap on Kubernetes nodes, you might encounter a critical issue where misbehaving containers don’t get killed automatically. When this affects etcd, the API server generating excessive load and consuming all available resources. This post explains the problem and provides two solutions.
Parts of the K8S Security Lab series
Container Runetime Security
- Part1: How to deploy CRI-O with Firecracker?
- Part2: How to deploy CRI-O with gVisor?
- Part3: How to deploy containerd with Firecracker?
- Part4: How to deploy containerd with gVisor?
- Part5: How to deploy containerd with kata containers?
Advanced Kernel Security
- Part1: Hardening Kubernetes with seccomp
- Part2: Linux user namespace management wit CRI-O in Kubernetes
- Part3: Hardening Kubernetes with seccomp
Network Security
- Part1: RKE2 Install With Calico
- Part2: RKE2 Install With Cilium
- Part3: CNI-Genie: network separation with multiple CNI
- Part3: Configurre network wit nmstate operator
- Part3: Kubernetes Network Policy
- Part4: Kubernetes with external Ingress Controller with vxlan
- Part4: Kubernetes with external Ingress Controller with bgp
- Part4: Central authentication with oauth2-proxy
- Part5: Secure your applications with Pomerium Ingress Controller
- Part6: CrowdSec Intrusion Detection System (IDS) for Kubernetes
- Part7: Kubernetes audit logs and Falco
Secure Kubernetes Install
- Part1: Best Practices to keeping Kubernetes Clusters Secure
- Part2: Kubernetes Secure Install
- Part3: Kubernetes Hardening Guide with CIS 1.6 Benchmark
- Part4: Kubernetes Certificate Rotation
User Security
- Part1: How to create kubeconfig?
- Part2: How to create Users in Kubernetes the right way?
- Part3: Kubernetes Single Sign-on with Pinniped OpenID Connect
- Part4: Kubectl authentication with Kuberos Depricated !!
- Part5: Kubernetes authentication with Keycloak and gangway Depricated !!
- Part6: kube-openid-connect 1.0 Depricated !!
Image Security
Pod Security
- Part1: Using Admission Controllers
- Part2: RKE2 Pod Security Policy
- Part3: Kubernetes Pod Security Admission
- Part4: Kubernetes: How to migrate Pod Security Policy to Pod Security Admission?
- Part5: Pod Security Standards using Kyverno
- Part6: Kubernetes Cluster Policy with Kyverno
Secret Security
- Part1: Kubernetes and Vault integration
- Part2: Kubernetes External Vault integration
- Part3: ArgoCD and kubeseal to encript secrets
- Part4: Flux2 and kubeseal to encrypt secrets
- Part5: Flux2 and Mozilla SOPS to encrypt secrets
Monitoring and Observability
- Part6: K8S Logging And Monitoring
- Part7: Install Grafana Loki with Helm3
Backup
When enabling swap on Kubernetes nodes, you might encounter a critical issue where misbehaving containers don’t get killed automatically—they hang indefinitely. When this affects etcd on control plane nodes, the consequences are severe: the API server continuously tries to communicate with the local etcd instance, generating excessive load and consuming all available resources.
The Problem: Swap and Container Memory Management
Kubernetes was designed with the assumption that swap is disabled. When swap is enabled, the kubelet’s ability to enforce memory limits is compromised. Here’s what happens:
graph TD
A[Container Memory Pressure] --> B{Swap Enabled?}
B -->|Yes| C[Memory Swapped to Disk]
B -->|No| D[Container OOM Killed]
C --> E[Container Hangs/Freezes]
D --> F[Container Restarted]
E --> G[Service Degradation]
F --> H[Service Recovers]
Why etcd Suffers Most
When etcd runs on a control plane node with swap enabled and experiences memory pressure:
- etcd doesn’t get killed - Instead of an OOM kill and restart, it swaps to disk
- etcd becomes unresponsive - Disk I/O is orders of magnitude slower than RAM
- API server keeps retrying - The kube-apiserver constantly attempts to reach the local etcd
- Resource exhaustion - Retry loops consume CPU and network resources
- Control plane cascade failure - Other components start failing
sequenceDiagram
participant API as kube-apiserver
participant ETCD as etcd (swapped)
participant KUBELET as kubelet
participant SCHED as scheduler
ETCD->>ETCD: Memory pressure
ETCD->>ETCD: Swap to disk (slow)
API->>ETCD: Request
ETCD-->>API: Timeout (no response)
API->>ETCD: Retry request
ETCD-->>API: Timeout
loop Retry Loop
API->>ETCD: Continuous requests
Note over API,ETCD: CPU spike + Network saturation
end
KUBELET->>ETCD: Health check
KUBELET-->>KUBELET: etcd unhealthy but not dead
SCHED->>API: Cannot schedule pods
Note over SCHED: Cluster degraded
endloop
Solution 1: Set Resource Limits for Control Plane Components
The first line of defense is setting explicit resource limits for all control plane components. This prevents any single component from consuming all available resources.
Post-Installation Configuration
After cluster installation, create static pod manifests with resource limits:
# Backup existing manifests
sudo cp -r /etc/kubernetes/manifests /etc/kubernetes/manifests.backup
kube-apiserver Resources
Edit /etc/kubernetes/manifests/kube-apiserver.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- name: kube-apiserver
image: registry.k8s.io/kube-apiserver:v1.29.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
# ... rest of config
kube-controller-manager Resources
Edit /etc/kubernetes/manifests/kube-controller-manager.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- name: kube-controller-manager
image: registry.k8s.io/kube-controller-manager:v1.29.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
kube-scheduler Resources
Edit /etc/kubernetes/manifests/kube-scheduler.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-scheduler
namespace: kube-system
spec:
containers:
- name: kube-scheduler
image: registry.k8s.io/kube-scheduler:v1.29.0
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
etcd Resources
Edit /etc/kubernetes/manifests/etcd.yaml:
apiVersion: v1
kind: Pod
metadata:
name: etcd
namespace: kube-system
spec:
containers:
- name: etcd
image: registry.k8s.io/etcd:3.5.10-0
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "2000m"
env:
- name: ETCD_MEM_LIMIT
value: "2Gi"
Note: After modifying static pod manifests, the kubelet will automatically restart the pods with new configurations.
Configuration via kubeadm (During Installation)
For new clusters, configure resource limits during initialization using a kubeadm configuration file.
Create kubeadm-config.yaml:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.29.0
controlPlaneEndpoint: "lb.example.com:6443"
# API Server configuration
apiServer:
extraArgs:
authorization-mode: Node,RBAC
timeoutForControlPlane: 4m0s
extraVolumes: []
# Controller Manager configuration
controllerManager:
extraArgs:
bind-address: 0.0.0.0
extraVolumes: []
# Scheduler configuration
scheduler:
extraArgs:
bind-address: 0.0.0.0
extraVolumes: []
# etcd configuration
etcd:
local:
dataDir: /var/lib/etcd
extraArgs:
quota-backend-bytes: "8589934592" # 8GB
max-request-bytes: "10485760" # 10MB
# Network configuration
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
podSubnet: 10.244.0.0/16
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: "192.168.1.10"
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: "control-plane-1"
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Important: Configure memory management
memorySwap:
swapBehavior: LimitedSwap # Allow limited swap
maxPods: 110
serializeImagePulls: false
Initialize the cluster:
sudo kubeadm init --config kubeadm-config.yaml
For joining control plane nodes, create kubeadm-join-config.yaml:
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
controlPlane:
localAPIEndpoint:
advertiseAddress: "192.168.1.11"
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: "control-plane-2"
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
discovery:
bootstrapToken:
apiServerEndpoint: "lb.example.com:6443"
token: "abcdef.0123456789abcdef"
unsafeSkipCAVerification: false
sudo kubeadm join --config kubeadm-join-config.yaml
Solution 2: Deploy etcd LoadBalancer DaemonSet
The second solution distributes etcd traffic across all control plane nodes, preventing any single node’s etcd from becoming a bottleneck.
Architecture Overview
graph LR
subgraph "Control Plane Nodes"
A[kube-apiserver] --> B[Local LB DaemonSet]
B --> C{Load Balancer}
C --> D[etcd-1:2379]
C --> E[etcd-2:2379]
C --> F[etcd-3:2379]
end
subgraph "etcd Cluster"
D
E
F
end
style B fill:#f9f,stroke:#333
style C fill:#bbf,stroke:#333
Step 1: Create etcd LoadBalancer Service
Create etcd-lb.yaml:
apiVersion: v1
kind: Service
metadata:
name: etcd-loadbalancer
namespace: kube-system
labels:
app: etcd-lb
spec:
type: ClusterIP
clusterIP: None # Headless service
ports:
- name: etcd-client
port: 2379
targetPort: 2379
protocol: TCP
selector:
component: etcd
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-loadbalancer
namespace: kube-system
labels:
app: etcd-lb
subsets:
- addresses:
- ip: 192.168.1.10 # control-plane-1
- ip: 192.168.1.11 # control-plane-2
- ip: 192.168.1.12 # control-plane-3
ports:
- name: etcd-client
port: 2379
protocol: TCP
Step 2: Deploy HAProxy DaemonSet
Create etcd-haproxy-daemonset.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: etcd-haproxy
namespace: kube-system
labels:
app: etcd-haproxy
spec:
selector:
matchLabels:
app: etcd-haproxy
template:
metadata:
labels:
app: etcd-haproxy
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- operator: Exists
effect: NoSchedule
containers:
- name: haproxy
image: haproxy:2.8-alpine
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
ports:
- containerPort: 2380
hostPort: 2380
name: etcd-lb
volumeMounts:
- name: haproxy-config
mountPath: /usr/local/etc/haproxy
livenessProbe:
tcpSocket:
port: 2380
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 2380
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: haproxy-config
configMap:
name: etcd-haproxy-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: etcd-haproxy-config
namespace: kube-system
data:
haproxy.cfg: |
global
log stdout format raw local0
maxconn 4096
defaults
log global
mode tcp
timeout connect 5s
timeout client 30s
timeout server 30s
option dontlognull
frontend etcd-frontend
bind *:2380
default_backend etcd-servers
backend etcd-servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server etcd-1 192.168.1.10:2379 check inter 5s fall 3 rise 2
server etcd-2 192.168.1.11:2379 check inter 5s fall 3 rise 2
server etcd-3 192.168.1.12:2379 check inter 5s fall 3 rise 2
Apply the configuration:
kubectl apply -f etcd-haproxy-daemonset.yaml
Step 3: Configure API Server to Use LoadBalancer
Modify the kube-apiserver manifest to point to the local HAProxy instance:
Edit /etc/kubernetes/manifests/kube-apiserver.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- name: kube-apiserver
command:
- kube-apiserver
- --etcd-servers=http://127.0.0.1:2380 # Point to local HAProxy
# ... rest of config
The kubelet will automatically restart the API server with the new configuration.
Verify the Setup
# Check HAProxy pods are running
kubectl get pods -n kube-system -l app=etcd-haproxy
# Verify etcd connectivity
kubectl exec -n kube-system etcd-control-plane-1 -- \
etcdctl --endpoints=http://127.0.0.1:2380 endpoint health
# Check API server logs
kubectl logs -n kube-system kube-apiserver-control-plane-1 | grep -i etcd
Additional Recommendations
1. Monitor etcd Performance
Deploy monitoring for etcd metrics:
# etcd metrics service
apiVersion: v1
kind: Service
metadata:
name: etcd-metrics
namespace: kube-system
spec:
type: ClusterIP
clusterIP: None
ports:
- name: metrics
port: 2381
targetPort: 2381
selector:
component: etcd
Key metrics to watch:
etcd_server_has_leader- Leader election statusetcd_server_leader_changes_seen_total- Leader change frequencyetcd_mvcc_db_total_size_in_bytes- Database sizeetcd_disk_backend_commit_duration_seconds- Disk commit latency
2. Consider Disabling Swap
If possible, the best solution is to disable swap entirely:
# Disable swap immediately
sudo swapoff -a
# Remove swap from fstab
sudo sed -i '/swap/d' /etc/fstab
# Verify
free -h # Swap should show 0
3. Use Dedicated etcd Nodes
For production clusters, consider running etcd on dedicated nodes separate from the control plane components.
Conclusion
Running Kubernetes with swap enabled introduces significant risks, especially for etcd stability. The combination of:
- Resource limits on control plane components
- Load balancing etcd traffic across all control plane nodes
- Proper monitoring of etcd health metrics
provides defense-in-depth against control plane failures. However, for production environments, disabling swap remains the recommended approach.
graph TD
A[Kubernetes with Swap] --> B{Risk Level}
B -->|No Protection| C[High Risk: Control Plane Failure]
B -->|Resource Limits Only| D[Medium Risk: Partial Protection]
B -->|Limits + LB| E[Lower Risk: Distributed Load]
B -->|Swap Disabled| F[Best Practice: No Swap Issues]
style C fill:#f96
style D fill:#ff9
style E fill:#ff9
style F fill:#9f9
Remember: A stable etcd is the foundation of a healthy Kubernetes cluster. Invest in proper resource management and monitoring from the start.