In this post I will show you how you can influence the Kubernetes Scheduler where to schedule a pod.
When the Kubernetes Scheduler schedule a pod it examines each node for whether or not it can host the Pod. The scheduler uses the following equation to calculate the available memory on a given node:
Usable memory = available memory - reserved memory
The reserved memory refers to:
- Memory used by Kubernetes daemons like kubelet, containerd (or another container runtime).
- Memory is used by the node’s operating system. For example, kernel daemons.
If you are following the best practices you are declaring the amount of CPU and memory your Pods require through requests and limits.
Influencing the Scheduling Process
You have multiple ways to influence the scheduler. In the simplest way is to force a Pod to run on one - and only one - node by specifying its name in the
apiVersion: v1 kind: Pod metadata: name: nginx spec: containers: - name: nginx image: nginx nodeName: app-prod01
Taints and Tolerations
Suppose we didn’t want any pods to run on a specific node. You might need to do this for a variety of reasons. Whatever the particular reason, we need a way to ensure our pods are not placed on a certain node. That’s where a taint comes in.
When a node is
tainted, no Pod can be scheduled to it unless the Pod
tolerates the taint. You can taint a node with a command like this:
kubectl taint nodes [node name] [key=value]:NoSchedule kubectl taint nodes worker-01 locked=true:NoSchedule
The definition for a Pod that has the necessary toleration to get scheduled on the tainted node look like this:
apiVersion: v1 kind: Pod metadata: name: mypod spec: containers: - name: mycontainer image: nginx tolerations: - key: "locked" operator: "Equal" value: "true" effect: "NoSchedule"
Node Affinity gives you more flexible way to choose a node by allowing you to define hard and soft node-requirements. The hard requirements must be matched on the node to be selected, but the soft requirements allows you to add more weight to nodes with specific labels. The mos basic example for this scenario to choose a node with ssd for your database instance:
apiVersion: v1 kind: Pod metadata: name: db spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disk-type operator: In values: - ssd preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: zone operator: In values: - zone1 - zone2 containers: - name: db image: mysql
requiredDuringSchedulingIgnoredDuringExecution is the hard requirement and the
preferredDuringSchedulingIgnoredDuringExecution is the soft requirement. You can add Affinity not just for a nod but the pods themselves.
You can yous hard (
requiredDuringSchedulingIgnoredDuringExecution) and soft (
preferredDuringSchedulingIgnoredDuringExecution) requirements for the pods too:
apiVersion: v1 kind: Pod metadata: name: middleware spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: role operator: In values: - frontend topologyKey: kubernetes.io/hostname podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: role operator: In values: - auth topologyKey: kubernetes.io/hostname containers: - name: middleware image: redis
You may have noticed that both the hard and soft requirements have the IgnoredDuringExecution suffix. It means that after the scheduling decision has been made, the scheduler will not attempt to change already-placed Pods even if the conditions changed.
For example if your application has multiple replicas and you did’t want to schedule two pod from this app to the same host:
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: nginx spec: replicas: 5 template: metadata: labels: app: nginx spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchLabels: app: nginx container: image: nginx:latest
Pod Topology Spread Constraints
You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes.
Suppose you have a 4-node cluster with the following labels:
NAME STATUS ROLES AGE VERSION LABELS node1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneA node2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneA node3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneB node4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB
You can define one or multiple
topologySpreadConstraint to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be something like this:
apiVersion: apps/v1 kind: Deployment metadata: name: nginx namespace: nginx spec: replicas: 8 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: nginx containers: - name: nginx image: nginx:latest
Besides the usual deployment specification, we have additionally defined
TopologySpreadConstraints as such:
- maxSkew: 1 — distribute pods in an absolute even manner
kubernetes.io/hostname—use the hostname as topology domain
- whenUnsatisfiable: ScheduleAnyway — always schedule pods even if it can’t satisfy even distribution of pods
- labelSelector —only act on Pods that match this selector
You can use 2
TopologySpreadConstraints to control the Pods spreading on both zone and node:
--- apiVersion: apps/v1 kind: Deployment metadata: name: nginx namespace: nginx spec: replicas: 8 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: zone whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: nginx - maxSkew: 1 topologyKey: node whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: nginx containers: - name: nginx image: nginx:latest
Ofcourse, you can use the default
kubernetes.io/hostname label for