# Influencing Kubernetes Scheduler Decisions

Page content

In this post I will show you how you can influence the Kubernetes Scheduler where to schedule a pod.

### Kubernetes Scheduler

When the Kubernetes Scheduler schedule a pod it examines each node for whether or not it can host the Pod. The scheduler uses the following equation to calculate the available memory on a given node: Usable memory = available memory - reserved memory

The reserved memory refers to:

• Memory used by Kubernetes daemons like kubelet, containerd (or another container runtime).
• Memory is used by the node’s operating system. For example, kernel daemons.

If you are following the best practices you are declaring the amount of CPU and memory your Pods require through requests and limits.

### Influencing the Scheduling Process

Youhavwe multiple ways to influence the scheduler. In the simplest way is to force a Pod to run on one - and only one - node by specifying its name in the .spec.nodeName.

apiVersion: v1
kind: Pod
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: app-prod01


### Taints and Tolerations

Suppose we didn’t want any pods to run on a specific node. You might need to do this for a variety of reasons. Whatever the particular reason, we need a way to ensure our pods are not placed on a certain node. That’s where a taint comes in.

When a node is tainted, no Pod can be scheduled to it unless the Pod tolerates the taint. You can taintt a node with a command like this:

kubectl taint nodes [node name] [key=value]:NoSchedule

kubectl taint nodes worker-01 locked=true:NoSchedule


The definition for a Pod that has the necessary toleration to get scheduled on the tainted node look like this:

apiVersion: v1
kind: Pod
name: mypod
spec:
containers:
- name: mycontainer
image: nginx
tolerations:
- key: "locked"
operator: "Equal"
value: "true"
effect: "NoSchedule"


### Node Affinity

Node Affinity gives you more flexible way to chouse a node by allowing you to define hard and soft node-requirements. The hard requirements must be matched on the node to be selected, but the soft requirements allows you to add more weight to nodes with specific labels. The mos basic examle for this scenario to chouse a node with ssd for your database instance:

apiVersion: v1
kind: Pod
name: db
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disk-type
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: zone
operator: In
values:
- zone1
- zone2
containers:
- name: db
image: mysql


The requiredDuringSchedulingIgnoredDuringExecution is the hard requirement and the preferredDuringSchedulingIgnoredDuringExecution is the soft requirement. You can add Affinity not just for a nod but the pods themselves.

### Pod Affinity

You can yous hard (requiredDuringSchedulingIgnoredDuringExecution) and soft (preferredDuringSchedulingIgnoredDuringExecution) requirements for the pods too:

apiVersion: v1
kind: Pod
name: middleware
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: role
operator: In
values:
- frontend
topologyKey: kubernetes.io/hostname
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: role
operator: In
values:
- auth
topologyKey: kubernetes.io/hostname
containers:
- name: middleware
image: redis


You may have noticed that both the hard and soft requirements have the IgnoredDuringExecution suffix. It means that after the scheduling decision has been made, the scheduler will not attempt to change already-placed Pods even if the conditions changed.

For example if your application has multiple replicas and you did’t want to shedule two pod from this app to the same host:

apiVersion: extensions/v1beta1
kind: Deployment
name: nginx
spec:
replicas: 5
template:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: nginx
container:
image: nginx:latest


You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes.

Suppose you have a 4-node cluster with the following labels:

NAME    STATUS   ROLES    AGE     VERSION   LABELS
node1   Ready    <none>   4m26s   v1.16.0   node=node1,zone=zoneA
node2   Ready    <none>   3m58s   v1.16.0   node=node2,zone=zoneA
node3   Ready    <none>   3m17s   v1.16.0   node=node3,zone=zoneB
node4   Ready    <none>   2m43s   v1.16.0   node=node4,zone=zoneB


You can define one or multiple topologySpreadConstraint to instruct the kube-scheduler how to place each incoming Pod in relation to the existing Pods across your cluster. If we want an incoming Pod to be evenly spread with existing Pods across zones, the spec can be sometin like this:

kind: Pod
apiVersion: v1
name: nginx
labels:
app: nginx
spec:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
containers:
- name: nginx
image: nginx:latest


You can use 2 TopologySpreadConstraints to control the Pods spreading on both zone and node:

kind: Pod
apiVersion: v1
name: nginx
labels:
app: nginx
spec:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
containers:
- name: nginx
image: nginx:latest