Remove a node from service

There might be times when you have to remove a specific node (or multiple nodes) from service, either temporarily or permanently. This might include cases of troubleshooting nodes that are in a bad state, or retiring nodes after an update to the AMI so that all nodes are using the new AMI.

This topic describes how to temporarily prevent new workloads from being assigned to a node, as well as how to safely remove workloads from a node so that it can be permanently retired.

Temporarily remove a node from service

The kubectl cordon <node> command prevents additional pods from being scheduled onto the node, without disrupting any of the pods currently running on it. For example, let’s say a new node in your cluster has come up with some problems, and you want to cordon it before launching any new runs to ensure they will not land on that node. The procedure might look like this:

$ kubectl get nodes
NAME                                          STATUS   ROLES    AGE   VERSION
ip-192-168-0-221.us-east-2.compute.internal   Ready    <none>   12d   v1.14.7-eks-1861c5
ip-192-168-17-8.us-east-2.compute.internal    Ready    <none>   12d   v1.14.7-eks-1861c5
ip-192-168-24-46.us-east-2.compute.internal   Ready    <none>   51m   v1.14.7-eks-1861c5
ip-192-168-3-110.us-east-2.compute.internal   Ready    <none>   12d   v1.14.7-eks-1861c5
$ kubectl cordon ip-192-168-24-46.us-east-2.compute.internal
node/ip-192-168-24-46.us-east-2.compute.internal cordoned
$ kubectl get no
NAME                                          STATUS                     ROLES    AGE   VERSION
ip-192-168-0-221.us-east-2.compute.internal   Ready                      <none>   12d   v1.14.7-eks-1861c5
ip-192-168-17-8.us-east-2.compute.internal    Ready                      <none>   12d   v1.14.7-eks-1861c5
ip-192-168-24-46.us-east-2.compute.internal   Ready,SchedulingDisabled   <none>   53m   v1.14.7-eks-1861c5
ip-192-168-3-110.us-east-2.compute.internal   Ready                      <none>   12d   v1.14.7-eks-1861c5

Notice the SchedulingDisabled status on the cordoned node.

You can undo this and return the node to service with the command:

kubectl cordon <node>

Permanently remove a node from service

Identify user workloads

Before removing a node from service permanently, you must ensure there are no workloads still running on it that must not be disrupted. For example, you might see the following workloads running on a node (notice the specification of the compute namespace with -n and wide output to include the node hosting the pod with -o):

$ kubectl get po -n domino-compute -o wide | grep ip-192-168-24-46.us-east-2.compute.internal
run-5e66acf26437fe0008ca1a88-f95mk               2/2     Running     0          23m     192.168.4.206    ip-192-168-24-46.us-east-2.compute.internal    <none>           <none>
run-5e66ad066437fe0008ca1a8f-629p9               3/3     Running     0          24m     192.168.28.87    ip-192-168-24-46.us-east-2.compute.internal    <none>           <none>
run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7    3/3     Running     0          51m     192.168.23.128   ip-192-168-24-46.us-east-2.compute.internal    <none>           <none>
model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9  2/2     Running     0          54m     192.168.28.1     ip-192-168-24-46.us-east-2.compute.internal    <none>           <none>
domino-build-5e67c9299c330f0008f70ad1            1/1     Running     0          3s      192.168.13.131   ip-192-168-24-46.us-east-2.compute.internal    <none>           <none>

Different types of workloads must be treated differently.

To see the details of a specific workload, run the following command:
```
kubectl describe po run-5e66acf26437fe0008ca1a88-f95mk -n domino-compute
```
The labels section of the describe output is particularly useful to distinguish the type of workload, as each of the workloads named as run-… will have a label like dominodatalab.com/workload-type=<type of workload>..

The previous example contains one each of the major user workloads:
- run-5e66acf26437fe0008ca1a88-f95mk is a Job, with label dominodatalab.com/workload-type=Batch. It will stop running on its own once it is finished and disappear from the list of active workloads.
- run-5e66ad066437fe0008ca1a8f-629p9, is a Workspace, with label dominodatalab.com/workload-type=Workspace. It will keep running until the user who launched it shut it down. You can contact users to shut down their workspaces, waiting a day or two for them to shut them down, or remove the node with the workspaces still running.
  
  Caution
  The last option is not recommended unless you are certain there is no un-synced work in any of the workspaces and have communicated with the users about the interruption.
- run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7, is an App, with the label dominodatalab.com/workload-type=App. It is a long-running process, and is governed by a kubernetes deployment. It will be recreated automatically if you destroy the node hosting it, but will experience whatever downtime is required for a new pod to be created and scheduled on another node. See below for methods to proactively move the pod and reduce downtime.---------------
- model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9, is a Model API. It does not have a dominodatalab.com/workload-type label, and instead is easily identifiable by the pod name. It is also a long-running process, similar to an app, with similar concerns. See below for methods to proactively move the pod and reduce downtime.---------------
- domino-build-5e67c9299c330f0008f70ad1 is an Environment. It will finish on its own and go into a Completed state.