Influencing Kubernetes Scheduler Decisions

In this post I will show you how you can influence the Kubernetes Scheduler where to schedule a pod.

Kubernetes Scheduler

When the Kubernetes Scheduler schedule a pod it examines each node for whether or not it can host the Pod. The scheduler uses the following equation to calculate the available memory on a given node: Usable memory = available memory - reserved memory

The reserved memory refers to:

  • Memory used by Kubernetes daemons like kubelet, containerd (or another container runtime).
  • Memory is used by the node’s operating system. For example, kernel daemons.

If you are following the best practices you are declaring the amount of CPU and memory your Pods require through requests and limits.

Influencing the Scheduling Process

Youhavwe multiple ways to influence the scheduler. In the simplest way is to force a Pod to run on one - and only one - node by specifying its name in the .spec.nodeName.

apiVersion: v1
kind: Pod
metadata:
 name: nginx
spec:
 containers:
 - name: nginx
   image: nginx
 nodeName: app-prod01

Taints and Tolerations

Suppose we didn’t want any pods to run on a specific node. You might need to do this for a variety of reasons. Whatever the particular reason, we need a way to ensure our pods are not placed on a certain node. That’s where a taint comes in.

When a node is tainted, no Pod can be scheduled to it unless the Pod tolerates the taint. You can taintt a node with a command like this:

kubectl taint nodes [node name] [key=value]:NoSchedule

kubectl taint nodes worker-01 locked=true:NoSchedule

The definition for a Pod that has the necessary toleration to get scheduled on the tainted node look like this:

apiVersion: v1
kind: Pod
metadata:
 name: mypod
spec:
 containers:
 - name: mycontainer
   image: nginx
 tolerations:
 - key: "locked"
   operator: "Equal"
   value: "true"
   effect: "NoSchedule"

Node Affinity

Node Affinity gives you more flexible way to chouse a node by allowing you to define hard and soft node-requirements. The hard requirements must be matched on the node to be selected, but the soft requirements allows you to add more weight to nodes with specific labels. The mos basic examle for this scenario to chouse a node with ssd for your database instance:

apiVersion: v1
kind: Pod
metadata:
 name: db
spec:
 affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: disk-type
           operator: In
           values:
           - ssd
     preferredDuringSchedulingIgnoredDuringExecution:
     - weight: 1
       preference:
         matchExpressions:
         - key: zone
           operator: In
           values:
           - zone1
           - zone2
 containers:
 - name: db
   image: mysql

The requiredDuringSchedulingIgnoredDuringExecution is the hard requirement and the preferredDuringSchedulingIgnoredDuringExecution is the soft requirement. You can add Affinity not just for a nod but the pods themselves.

Pod Affinity

You can yous hard (requiredDuringSchedulingIgnoredDuringExecution) and soft (preferredDuringSchedulingIgnoredDuringExecution) requirements for the pods too:

apiVersion: v1
kind: Pod
metadata:
 name: middleware
spec:
 affinity:
   podAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
     - labelSelector:
         matchExpressions:
         - key: role
           operator: In
           values:
           - frontend
       topologyKey: kubernetes.io/hostname
   podAntiAffinity:
     preferredDuringSchedulingIgnoredDuringExecution:
     - weight: 100
       podAffinityTerm:
         labelSelector:
           matchExpressions:
           - key: role
             operator: In
             values:
             - auth
         topologyKey: kubernetes.io/hostname
 containers:
 - name: middleware
   image: redis

You may have noticed that both the hard and soft requirements have the IgnoredDuringExecution suffix. It means that after the scheduling decision has been made, the scheduler will not attempt to change already-placed Pods even if the conditions changed.

For example if your application has multiple replicas and you did’t want to shedule two pod from this app to the same host:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 5
  template:
    metadata:
      labels:                                            
        app: nginx                                   
    spec:
      affinity:
        podAntiAffinity:                                 
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:                               
              matchLabels:                               
                app: nginx        
      container:
        image: nginx:latest

Pod Topology Spread Constraints

You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes.

Suppose you have a 4-node cluster with the following labels:

NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4m26s   v1.16.0   node=node1,zone=zoneA
node2   Ready    <none>   3m58s   v1.16.0   node=node2,zone=zoneA
node3   Ready    <none>   3m17s   v1.16.0   node=node3,zone=zoneB
node4   Ready    <none>   2m43s   v1.16.0   node=node4,zone=zoneB

You can define one or multiple topologySpreadConstraint to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be sometin like this:

kind: Pod
apiVersion: v1
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: nginx
  containers:
  - name: nginx
    image: nginx:latest

You can use 2 TopologySpreadConstraints to control the Pods spreading on both zone and node:

kind: Pod
apiVersion: v1
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: nginx
  - maxSkew: 1
    topologyKey: node
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: nginx
  containers:
  - name: nginx
    image: nginx:latest