There might be times when you have to remove a specific node (or multiple nodes) from service, either temporarily or permanently. This might include cases of troubleshooting nodes that are in a bad state, or retiring nodes after an update to the AMI so that all nodes are using the new AMI.
This topic describes how to temporarily prevent new workloads from being assigned to a node, as well as how to safely remove workloads from a node so that it can be permanently retired.
The kubectl cordon <node>
command will prevent any additional pods
from being scheduled onto the node, without disrupting any of the pods currently running on it. For example, let’s say a new node in your cluster has come up with some problems, and you want to cordon it before launching any new runs to ensure they will not land on that node. The procedure might look like this:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-0-221.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-1861c5
ip-192-168-17-8.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-1861c5
ip-192-168-24-46.us-east-2.compute.internal Ready <none> 51m v1.14.7-eks-1861c5
ip-192-168-3-110.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-1861c5
$ kubectl cordon ip-192-168-24-46.us-east-2.compute.internal
node/ip-192-168-24-46.us-east-2.compute.internal cordoned
$ kubectl get no
NAME STATUS ROLES AGE VERSION
ip-192-168-0-221.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-1861c5
ip-192-168-17-8.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-1861c5
ip-192-168-24-46.us-east-2.compute.internal Ready,SchedulingDisabled <none> 53m v1.14.7-eks-1861c5
ip-192-168-3-110.us-east-2.compute.internal Ready <none> 12d v1.14.7-eks-1861c5
Notice the SchedulingDisabled
status on the cordoned node.
You can undo this and return the node to service with the command
kubectl cordon <node>
.
Identify user workloads
Before removing a node from service permanently, you must ensure there are no workloads still running on it that should not be disrupted. For example, you might see the following workloads running on a node (notice the specification of the compute namespace with -n
and wide output to include the node hosting the pod with -o
):
$ kubectl get po -n domino-compute -o wide | grep ip-192-168-24-46.us-east-2.compute.internal
run-5e66acf26437fe0008ca1a88-f95mk 2/2 Running 0 23m 192.168.4.206 ip-192-168-24-46.us-east-2.compute.internal <none> <none>
run-5e66ad066437fe0008ca1a8f-629p9 3/3 Running 0 24m 192.168.28.87 ip-192-168-24-46.us-east-2.compute.internal <none> <none>
run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7 3/3 Running 0 51m 192.168.23.128 ip-192-168-24-46.us-east-2.compute.internal <none> <none>
model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9 2/2 Running 0 54m 192.168.28.1 ip-192-168-24-46.us-east-2.compute.internal <none> <none>
domino-build-5e67c9299c330f0008f70ad1 1/1 Running 0 3s 192.168.13.131 ip-192-168-24-46.us-east-2.compute.internal <none> <none>
Different types of workloads must be treated differently. You can see the details of a particular workload with kubectl describe po run-5e66acf26437fe0008ca1a88-f95mk -n domino-compute
. The labels section of the describe output is particularly useful to
distinguish the type of workload, as each of the workloads named as
run-…
will have a label like dominodatalab.com/workload-type=<type of workload>.
. The previous example
contains one each of the major user workloads:
-
run-5e66acf26437fe0008ca1a88-f95mk
is a Batch Job, with labeldominodatalab.com/workload-type=Batch
. It will stop running on its own once it is finished and disappear from the list of active workloads. -
run-5e66ad066437fe0008ca1a8f-629p9
, is a Workspace, with labeldominodatalab.com/workload-type=Workspace
. It will keep running until the user who launched it shut it down. You have the option of contacting users to shut down their workspaces, waiting a day or two in the expectation they will shut them down naturally, or removing the node with the workspaces still running. (The last option is not recommended unless you are certain there is no un-synced work in any of the workspaces and have communicated with the users about the interruption.) -
run-5e66b65e9c330f0008f70ab8-85f4f5f58c-m46j7
, is an App, with labeldominodatalab.com/workload-type=App
. It is a long-running process, and is governed by a kubernetesdeployment
. It will be recreated automatically if you destroy the node hosting it, but will experience whatever downtime is required for a new pod to be created and scheduled on another node. See below for methods to proactively move the pod and reduce downtime. -
model-5e66ad4a9c330f0008f709e4-86bd9597b7-59fd9
, is a Model API. It does not have adominodatalab.com/workload-type
label, and instead is easily identifiable by the pod name. It is also a long-running process, similar to an app, with similar concerns. See below for methods to proactively move the pod and reduce downtime. -
domino-build-5e67c9299c330f0008f70ad1
is a Compute Environment. It will finish on its own and go into aCompleted
state.
Manage long-running workloads
For the long-running workloads governed by a Kubernetes deployment, you can proactively move the pods off of the cordoned node by running a command like this:
$ kubectl rollout restart deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute
Notice the name of the deployment is the same as the first part of the name of the pod in the above section. You can see a list of all deployments in the compute namespace by running kubectl get deploy -n domino-compute
.
Whether the associated app or model experiences any downtime will depend on the update strategy of the deployment. For the two example workloads above in a test deployment, one App and one Model API, we have the following (describe output filtered here for brevity):
$ kubectl describe deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute | grep -i "strategy\|replicas:"
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
RollingUpdateStrategy: 1 max unavailable, 1 max surge
$ kubectl describe deploy model-5e66ad4a9c330f0008f709e4 -n domino-compute | grep -i "strategy\|replicas:"
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
RollingUpdateStrategy: 0 max unavailable, 25% max surge
The App in this case would experience some downtime, since the old pod
will be terminated immediately (1 max unavailable
with only 1 pod
currently running). The model will not experience any downtime since the
termination of the old pod will be forced to wait until a new pod is
available (0 max unavailable
). If desired, you can edit the
deployments to change these settings and avoid downtime.
Manage older versions of Kubernetes
Earlier versions of kubernetes do not have the kubectl rollout restart
command, but a similar effect can be achieved by "patching" the deployment with a throwaway annotation like this:
$ kubectl patch deploy run-5e66b65e9c330f0008f70ab8 -n domino-compute -p '{"spec":{"template":{"metadata":{"annotations":{"migration_date":"'$(date +%Y%m%d)'"}}}}}'
The patching process will respect the same update strategies as the above restart command.
In cases where you have to retire many nodes, it can be useful to loop over many nodes and/or workload pods in a single command. Customizing the output format of kubectl
commands, appropriate filtering, and combining with xargs
makes this possible.
For example, to cordon all nodes in the default node pool, you can run the following:
$ kubectl get nodes -l dominodatalab.com/node-pool=default -o custom-columns=:.metadata.name --no-headers | xargs kubectl cordon
To view only apps running on a particular node, you can filter using the labels discussed previously:
$ kubectl get pods -n domino-compute -o wide -l dominodatalab.com/workload-type=App | grep <node-name>
To do a rolling restart of all model pods (over all nodes), you can run:
$ kubectl get deploy -n domino-compute -o custom-columns=:.metadata.name --no-headers | grep model | xargs kubectl rollout restart -n domino-compute deploy
When constructing such commands for larger maintenance, always run the first part of the command by itself to verify that the list of names being passed to xargs
and to the final kubectl
command are what you expect.