Kubernetes Swap and etcd Stability: Preventing Control Plane Hangs
When enabling swap on Kubernetes nodes, you might encounter a critical issue where misbehaving containers don’t get killed automatically. When this affects etcd, the API server generating excessive load and consuming all available resources. This post explains the problem and provides two solutions.
When enabling swap on Kubernetes nodes, you might encounter a critical issue where misbehaving containers don’t get killed automatically—they hang indefinitely. When this affects etcd on control plane nodes, the consequences are severe: the API server continuously tries to communicate with the local etcd instance, generating excessive load and consuming all available resources.
The Problem: Swap and Container Memory Management
Kubernetes was designed with the assumption that swap is disabled. When swap is enabled, the kubelet’s ability to enforce memory limits is compromised. Here’s what happens:
Why etcd Suffers Most
When etcd runs on a control plane node with swap enabled and experiences memory pressure:
- etcd doesn’t get killed - Instead of an OOM kill and restart, it swaps to disk
- etcd becomes unresponsive - Disk I/O is orders of magnitude slower than RAM
- API server keeps retrying - The kube-apiserver constantly attempts to reach the local etcd
- Resource exhaustion - Retry loops consume CPU and network resources
- Control plane cascade failure - Other components start failing
Solution 1: Set Resource Limits for Control Plane Components
The first line of defense is setting explicit resource limits for all control plane components. This prevents any single component from consuming all available resources.
Post-Installation Configuration
After cluster installation, create static pod manifests with resource limits:
# Backup existing manifests
sudo cp -r /etc/kubernetes/manifests /etc/kubernetes/manifests.backup
kube-apiserver Resources
Edit /etc/kubernetes/manifests/kube-apiserver.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- name: kube-apiserver
image: registry.k8s.io/kube-apiserver:v1.29.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
# ... rest of config
kube-controller-manager Resources
Edit /etc/kubernetes/manifests/kube-controller-manager.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- name: kube-controller-manager
image: registry.k8s.io/kube-controller-manager:v1.29.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
kube-scheduler Resources
Edit /etc/kubernetes/manifests/kube-scheduler.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-scheduler
namespace: kube-system
spec:
containers:
- name: kube-scheduler
image: registry.k8s.io/kube-scheduler:v1.29.0
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
etcd Resources
Edit /etc/kubernetes/manifests/etcd.yaml:
apiVersion: v1
kind: Pod
metadata:
name: etcd
namespace: kube-system
spec:
containers:
- name: etcd
image: registry.k8s.io/etcd:3.5.10-0
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "2000m"
env:
- name: ETCD_MEM_LIMIT
value: "2Gi"
Note: After modifying static pod manifests, the kubelet will automatically restart the pods with new configurations.
Configuration via kubeadm (During Installation)
For new clusters, configure resource limits during initialization using a kubeadm configuration file.
Create kubeadm-config.yaml:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.29.0
controlPlaneEndpoint: "lb.example.com:6443"
# API Server configuration
apiServer:
extraArgs:
authorization-mode: Node,RBAC
timeoutForControlPlane: 4m0s
extraVolumes: []
# Controller Manager configuration
controllerManager:
extraArgs:
bind-address: 0.0.0.0
extraVolumes: []
# Scheduler configuration
scheduler:
extraArgs:
bind-address: 0.0.0.0
extraVolumes: []
# etcd configuration
etcd:
local:
dataDir: /var/lib/etcd
extraArgs:
quota-backend-bytes: "8589934592" # 8GB
max-request-bytes: "10485760" # 10MB
# Network configuration
networking:
dnsDomain: cluster.local
serviceSubnet: 10.96.0.0/12
podSubnet: 10.244.0.0/16
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: "192.168.1.10"
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: "control-plane-1"
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Important: Configure memory management
memorySwap:
swapBehavior: LimitedSwap # Allow limited swap
maxPods: 110
serializeImagePulls: false
Initialize the cluster:
sudo kubeadm init --config kubeadm-config.yaml
For joining control plane nodes, create kubeadm-join-config.yaml:
apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration
controlPlane:
localAPIEndpoint:
advertiseAddress: "192.168.1.11"
bindPort: 6443
nodeRegistration:
criSocket: unix:///var/run/containerd/containerd.sock
imagePullPolicy: IfNotPresent
name: "control-plane-2"
taints:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
discovery:
bootstrapToken:
apiServerEndpoint: "lb.example.com:6443"
token: "abcdef.0123456789abcdef"
unsafeSkipCAVerification: false
sudo kubeadm join --config kubeadm-join-config.yaml
Solution 2: Deploy etcd LoadBalancer DaemonSet
The second solution distributes etcd traffic across all control plane nodes, preventing any single node’s etcd from becoming a bottleneck.
Architecture Overview
Step 1: Create etcd LoadBalancer Service
Create etcd-lb.yaml:
apiVersion: v1
kind: Service
metadata:
name: etcd-loadbalancer
namespace: kube-system
labels:
app: etcd-lb
spec:
type: ClusterIP
clusterIP: None # Headless service
ports:
- name: etcd-client
port: 2379
targetPort: 2379
protocol: TCP
selector:
component: etcd
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-loadbalancer
namespace: kube-system
labels:
app: etcd-lb
subsets:
- addresses:
- ip: 192.168.1.10 # control-plane-1
- ip: 192.168.1.11 # control-plane-2
- ip: 192.168.1.12 # control-plane-3
ports:
- name: etcd-client
port: 2379
protocol: TCP
Step 2: Deploy HAProxy DaemonSet
Create etcd-haproxy-daemonset.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: etcd-haproxy
namespace: kube-system
labels:
app: etcd-haproxy
spec:
selector:
matchLabels:
app: etcd-haproxy
template:
metadata:
labels:
app: etcd-haproxy
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- operator: Exists
effect: NoSchedule
containers:
- name: haproxy
image: haproxy:2.8-alpine
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
ports:
- containerPort: 2380
hostPort: 2380
name: etcd-lb
volumeMounts:
- name: haproxy-config
mountPath: /usr/local/etc/haproxy
livenessProbe:
tcpSocket:
port: 2380
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 2380
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: haproxy-config
configMap:
name: etcd-haproxy-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: etcd-haproxy-config
namespace: kube-system
data:
haproxy.cfg: |
global
log stdout format raw local0
maxconn 4096
defaults
log global
mode tcp
timeout connect 5s
timeout client 30s
timeout server 30s
option dontlognull
frontend etcd-frontend
bind *:2380
default_backend etcd-servers
backend etcd-servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server etcd-1 192.168.1.10:2379 check inter 5s fall 3 rise 2
server etcd-2 192.168.1.11:2379 check inter 5s fall 3 rise 2
server etcd-3 192.168.1.12:2379 check inter 5s fall 3 rise 2
Apply the configuration:
kubectl apply -f etcd-haproxy-daemonset.yaml
Step 3: Configure API Server to Use LoadBalancer
Modify the kube-apiserver manifest to point to the local HAProxy instance:
Edit /etc/kubernetes/manifests/kube-apiserver.yaml:
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- name: kube-apiserver
command:
- kube-apiserver
- --etcd-servers=http://127.0.0.1:2380 # Point to local HAProxy
# ... rest of config
The kubelet will automatically restart the API server with the new configuration.
Verify the Setup
# Check HAProxy pods are running
kubectl get pods -n kube-system -l app=etcd-haproxy
# Verify etcd connectivity
kubectl exec -n kube-system etcd-control-plane-1 -- \
etcdctl --endpoints=http://127.0.0.1:2380 endpoint health
# Check API server logs
kubectl logs -n kube-system kube-apiserver-control-plane-1 | grep -i etcd
Additional Recommendations
1. Monitor etcd Performance
Deploy monitoring for etcd metrics:
# etcd metrics service
apiVersion: v1
kind: Service
metadata:
name: etcd-metrics
namespace: kube-system
spec:
type: ClusterIP
clusterIP: None
ports:
- name: metrics
port: 2381
targetPort: 2381
selector:
component: etcd
Key metrics to watch:
etcd_server_has_leader- Leader election statusetcd_server_leader_changes_seen_total- Leader change frequencyetcd_mvcc_db_total_size_in_bytes- Database sizeetcd_disk_backend_commit_duration_seconds- Disk commit latency
2. Consider Disabling Swap
If possible, the best solution is to disable swap entirely:
# Disable swap immediately
sudo swapoff -a
# Remove swap from fstab
sudo sed -i '/swap/d' /etc/fstab
# Verify
free -h # Swap should show 0
3. Use Dedicated etcd Nodes
For production clusters, consider running etcd on dedicated nodes separate from the control plane components.
Conclusion
Running Kubernetes with swap enabled introduces significant risks, especially for etcd stability. The combination of:
- Resource limits on control plane components
- Load balancing etcd traffic across all control plane nodes
- Proper monitoring of etcd health metrics
provides defense-in-depth against control plane failures. However, for production environments, disabling swap remains the recommended approach.
Remember: A stable etcd is the foundation of a healthy Kubernetes cluster. Invest in proper resource management and monitoring from the start.