Backup your Kubernetes Cluster

Page content

In this post I will show you how you can backup your Kubernetes cluster.

Parts of the K8S Security Lab series

Container Runetime Security
Advanced Kernel Security
Network Security
Secure Kubernetes Install
User Security
Image Security
  • Part1: Image security Admission Controller
  • Part2: Image security Admission Controller V2
  • Part3: Image security Admission Controller V3
  • Part4: Continuous Image security
  • Part5: trivy-operator 1.0
  • Part6: trivy-operator 2.1: Trivy-operator is now an Admisssion controller too!!!
  • Part7: trivy-operator 2.2: Patch release for Admisssion controller
  • Part8: trivy-operator 2.3: Patch release for Admisssion controller
  • Part8: trivy-operator 2.4: Patch release for Admisssion controller
  • Part8: trivy-operator 2.5: Patch release for Admisssion controller
  • Part9_ Image Signature Verification with Connaisseur
  • Part10: Image Signature Verification with Connaisseur 2.0
  • Part11: Image Signature Verification with Kyverno
  • Part12: How to use imagePullSecrets cluster-wide??
  • Part13: Automatically change registry in pod definition
  • Part14: ArgoCD auto image updater
    Pod Security
    Secret Security
    Monitoring and Observability
    Backup

    Backup Kubernetes objects

    To backup kubernetes objects I use Velero (formerly Heptio Ark) for a long time. I thin thi is one of the best solution. Each Velero operation (on-demand backup, scheduled backup, restore) is a custom resource, stored in etcd. A backup opertaion is uploads a tarball of copied Kubernetes objects into cloud object storage. After that calls the cloud provider API to make disk snapshots of persistent volumes, if specified. Optionally you can specify hooks to be executed during the backup. When you create a backup, you can specify a TTL by adding the flag --ttl <DURATION>.

    Velero supported providers:

    Provider Object Store Volume Snapshotter
    Amazon Web Services (AWS) AWS S3 AWS EBS
    Google Cloud Platform (GCP) Google Cloud Storage Google Compute Engine Disks
    Microsoft Azure Azure Blob Storage Azure Managed Disks
    Portworx - Portworx Volume
    OpenEBS - OpenEBS CStor Volume
    VMware vSphere - vSphere Volumes
    Container Storage Interface (CSI) - CSI Volumes

    Install Velero client

    wget https://github.com/vmware-tanzu/velero/releases/download/v1.5.3/velero-v1.5.3-linux-amd64.tar.gz
    tar zxvf velero-v1.5.3-linux-amd64.tar.gz
    sudo cp velero-v1.5.3-linux-amd64/velero /usr/local/bin
    

    Install Velero server component

    First you need to create a secret that contains the S3 ccess_key and secret_key. In my case it is called minio.secret.

    velero install \
     --provider aws \
     --plugins velero/velero-plugin-for-aws:v1.1.0,velero/velero-plugin-for-csi:v0.1.2  \
     --bucket bucket  \
     --secret-file minio.secret  \
     --use-volume-snapshots=true \
     --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.mydomain.intra  \
     --snapshot-location-config region=default \
     --features=EnableCSI
    

    We need to annotate the snapshot class for Velero to use it to create a snapshots.

    kubectl label VolumeSnapshotClass csi-rbdplugin-snapclass \
    velero.io/csi-volumesnapshot-class=true
    
    kubectl label VolumeSnapshotClass csi-cephfsplugin-snapclass \
    velero.io/csi-volumesnapshot-class=true
    

    Create Backup

    velero backup create nginx-backup \
    --include-namespaces nginx-example --wait
    
    velero backup describe nginx-backup
    velero backup logs nginx-backup
    velero backup get
    
    velero schedule create nginx-daily --schedule="0 1 * * *" \
    --include-namespaces nginx-example
    
    velero schedule get
    velero backup get
    

    Automate Backup schedule with kyverno

    apiVersion: kyverno.io/v1
    kind: ClusterPolicy
    metadata:
      name: autobackup-policy
    spec:
      background: false
      rules:
      - name: "add-velero-autobackup-policy"
        match:
            resources:
              kinds:
                - Namespace
              selector:
                matchLabels:
                  nirmata.io/auto-backup: enabled
        generate:
            kind: Schedule
            name: "{{request.object.metadata.name}}-auto-schedule"
            namespace: velero
            apiVersion: velero.io/v1
            synchronize: true
            data:
              metadata:
                labels:
                  nirmata.io/backup.type: auto
                  nirmata.io/namespace: '{{request.object.metadata.name}}'
              spec:
                schedule: 0 1 * * *
                template:
                  includedNamespaces:
                    - "{{request.object.metadata.name}}"
                  snapshotVolumes: false
                  storageLocation: default
                  ttl: 168h0m0s
                  volumeSnapshotLocations:
                    - default
    

    Restore test

    kubectl delete ns nginx-example
    
    velero restore create nginx-restore-test --from-backup nginx-backup
    velero restore get
    
    kubectl get po -n nginx-example
    

    Backup etcd database

    Etcd Backup with RKE2

    With RKE2 the snapshoting of ETCD database is automaticle enabled. You can configure the snapshot interval in the rke2 config like this:

    mkdir -p /etc/rancher/rke2
    cat << EOF >  /etc/rancher/rke2/config.yaml
    write-kubeconfig-mode: "0644"
    profile: "cis-1.5"
    # Make a etcd snapshot every 6 hours
    etcd-snapshot-schedule-cron: "0 */6 * * *"
    # Keep 56 etcd snapshorts (equals to 2 weeks with 6 a day)
    etcd-snapshot-retention: 56
    EOF
    

    The snapshot directory defaults to /var/lib/rancher/rke2/server/db/snapshots

    Restoring RKE2 Cluster from a Snapshot

    To restore the cluster from backup, run RKE2 with the --cluster-reset option, with the --cluster-reset-restore-path also given:

    systemctl stop rke2-server
    rke2 server \
      --cluster-reset \
      --cluster-reset-restore-path=/rancher/rke2/server/db/etcd-old-%date%/
    

    Result: A message in the logs says that RKE2 can be restarted without the flags. Start RKE2 again and should run successfully and be restored from the specified snapshot.

    When rke2 resets the cluster, it creates a file at /var/lib/rancher/rke2/server/db/etc/reset-file. If you want to reset the cluster again, you will need to delete this file.

    Backup ETCD with kanister

    Kanister is a nother backup tool fro Kubernetes created by Veeam.

    Installing Kanister

    helm repo add kanister https://charts.kanister.io/
    helm install --name kanister --namespace kanister kanister/kanister-operator --set image.tag=0.50.0
    

    Before taking a backup of the etcd cluster, a Secret needs to be created, containing details about the authentication mechanism used by etcd and another for the S3 bucket. In the case of kubeadm, it is likely that etcd will have been deployed using TLS-based authentication.

    kanctl create profile s3compliant --access-key <aws-access-key> \
            --secret-key <aws-secret-key> \
            --bucket <bucket-name> --region <region-name> \
            --namespace kanister
    
    kubectl create secret generic etcd-details \
         --from-literal=cacert=/etc/kubernetes/pki/etcd/ca.crt \
         --from-literal=cert=/etc/kubernetes/pki/etcd/server.crt \
         --from-literal=endpoints=https://127.0.0.1:2379 \
         --from-literal=key=/etc/kubernetes/pki/etcd/server.key \
         --from-literal=etcdns=kube-system \
         --from-literal=labels=component=etcd,tier=control-plane \
         --namespace kanister
    
    kubectl label secret -n kanister etcd-details include=true
    kubectl annotate secret -n kanister etcd-details kanister.kasten.io/blueprint='etcd-blueprint'
    

    Kanister uses a CRD called Bluetoprint to read the backup sequence. There is an example Bluetoprint for Etcd backup:

    kubectl --namespace kasten apply -f \
        https://raw.githubusercontent.com/kanisterio/kanister/0.50.0/examples/etcd/etcd-in-cluster/k8s/etcd-incluster-blueprint.yaml
    

    Now we can create a backup by createing a CRD called ActionSet:

    kubectl create -n kanister -f -
    apiVersion: cr.kanister.io/v1alpha1
    kind: ActionSet
    metadata:
      creationTimestamp: null
      generateName: backup-
      namespace: kanister
    spec:
      actions:
      - blueprint: "<blueprint-name>"
        configMaps: {}
        name: backup
        object:
          apiVersion: v1
          group: ""
          kind: ""
          name: "<secret-name>"
          namespace: "<secret-namespace>"
          resource: secrets
        options: {}
        preferredVersion: ""
        profile:
          apiVersion: ""
          group: ""
          kind: ""
          name: "<profile-name>"
          namespace: kanister
          resource: ""
        secrets: {}
    EOF
    
    kubectl get actionsets
    kubectl describe actionsets -n kanister backup-hnp95
    

    Restore the ETCD cluster

    SSH into the node where ETCD is running, most usually it would be Kubernetes master node.

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key \
      --data-dir="/var/lib/etcd-from-backup" \
      --initial-cluster="ubuntu-s-4vcpu-8gb-blr1-01-master-1=https://127.0.0.1:2380" \
      --name="ubuntu-s-4vcpu-8gb-blr1-01-master-1" \
      --initial-advertise-peer-urls="https://127.0.0.1:2380" \
      --initial-cluster-token="etcd-cluster-1" \
      snapshot restore /tmp/etcd-backup.db
    

    And we will just have to instruct the ETCD that is running to use this new dir instead of the dir that it uses by default. To do that open the static pod manifest for ETCD, that would be /etc/kubernetes/manifests/etcd.yaml and

    • change the data-dir for the etcd container’s command to have /var/lib/etcd-from-backup
    • add another argument in the command --initial-cluster-token=etcd-cluster-1 as we have seen in the restore command
    • change the volume (named etcd-data) to have new dir /var/lib/etcd-from-backup
    • change volume mount (named etcd-data) to new dir /var/lib/etcd-from-backup

    once you save this manifest, new ETCD pod will be created with new data dir. Please wait for the ETCD pod to be up and running.

    Restoring ETCD snapshot in case of Multi Node ETCD cluster

    If your Kubernetes cluster is setup in such a way that you have more than one memeber of ETCD up and running, you will have to follow almost the same steps that we have already seen with some minor changes. So you have one snapshot file from backup and as the ETCD documentation says all the members should restore from the same snapshot. What we would do is choose one leader node that we will be using to restore the backup that we have taken and stop the static pods from all other leader nodes. To stop the static pods from other leader nodes you will have to move the static pod manifests from the static pod path, which in case of kubeadm is /etcd/kubernetes/manifests. Once you are sure that the containers on the other follower nodes have been stopped, please follow the step that is mentioned previously (Restore the ETCD cluster) on all the leader nodes sequentially.

    If we take a look into the bellow command that we are actually going to run to restore the snapshot

    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key \
      --data-dir="/var/lib/etcd-from-backup" \
      --initial-cluster="ubuntu-s-4vcpu-8gb-blr1-01-master-1=https://127.0.0.1:2380" \
      --name="ubuntu-s-4vcpu-8gb-blr1-01-master-1" \
      --initial-advertise-peer-urls="https://127.0.0.1:2380" \
      --initial-cluster-token="etcd-cluster-1" \
      snapshot restore /tmp/etcd-backup.db
    

    Make sure to change the of node name for the flag --initial-cluster and --name because this is going to change based on which leader node you are running the command on. We want be changing the value of --initial-cluster-token because etcdctl restore command creates a new member and we want all these new members to have same token, so that would belong to one cluster and accidently wouldnt join any other one.

    To explore more about this we can look into the Kubernetes documentation.