Manage long-running workloads

Move pods off a cordoned node
  1. For the long-running workloads governed by a Kubernetes deployment, use the following command tomove the pods off of the cordoned node:

    $ kubectl rollout restart deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute

    The name of the deployment is the same as the first part of the name of the pod in the previous section.

  2. To see a list of all deployments in the compute namespace, run:

    kubectl get deploy -n domino-compute

    Whether the associated app or model experiences any downtime depends on the update strategy of the deployment. For the previously described example workloads in a test deployment, one App and one Domino endpoint, you have the following describe output (filtered for brevity):

    $ kubectl describe deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute | grep -i "strategy|replicas:"
    Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
    StrategyType:           RollingUpdate
    RollingUpdateStrategy:  1 max unavailable, 1 max surge
    
    $ kubectl describe deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute | grep -i "strategy|replicas:"
    Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
    StrategyType:           RollingUpdate
    RollingUpdateStrategy:  0 max unavailable, 25% max surge

    This App would experience some downtime, since the old pod will be terminated immediately (1 max unavailable with only 1 pod currently running). The model will not experience any downtime since the termination of the old pod will be forced to wait until a new pod is available (0 max unavailable). You can edit the deployments to change these settings and avoid downtime.

Manage older versions of Kubernetes

Earlier versions of kubernetes do not have the kubectl rollout restart command, but you can achieve a similar effect by patching the deployment with a throwaway annotation like this:

$ kubectl patch deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute -p '{"spec":{"template":{"metadata":{"annotations":{"migration_date":"'$(date +%Y%m%d)'"}}}}}'

The patching process respects the same update strategies as the previously mentioned restart command.

Sample commands to retire several nodes

If you have to retire several nodes, you might want to loop over many nodes and/or workload pods in a single command. To do this, you can customize the output format of kubectl commands, filter them, and combine them with xargs.

When constructing commands for larger maintenance, always run the first part of the command by itself to verify that the list of names being passed to xargs and to the final kubectl command are what you expect.

Cordon all nodes in the default node pool
$ kubectl get nodes -l dominodatalab.com/node-pool=default -o custom-columns=:.metadata.name --no-headers | xargs kubectl cordon
Filter labels to view apps running on a particular node
$ kubectl get pods -n domino-compute -o wide -l dominodatalab.com/workload-type=App | grep <node-name>
Do a rolling restart of all model pods (over all nodes)
$ kubectl get deploy -n domino-compute -o custom-columns=:.metadata.name --no-headers | grep model | xargs kubectl rollout restart -n domino-compute deploy