Amazon Elastic Kubernetes Services (EKS) upgrade guide

This topic describes how to upgrade Kubernetes in your Amazon Elastic Kubernetes Service (EKS) Domino deployment. EKS is hosted on Amazon Web Services (AWS).

Important

Immediately after the Kubernetes upgrade, you must upgrade Domino to a compatible version of Kubernetes. Domino will not work as expected until this is completed.

For example, after upgrading Kubernetes to v1.22, you must upgrade to Domino v5.2 or later. Kubernetes v1.22 is not compatible with older versions of Domino. Similarly, after upgrading to Kubernetes 1.23 or 1.24, you must upgrade to Domino v5.3 or later.

Review the prerequisites

To upgrade Kubernetes on AWS, you must have the following:

The CDK project used to deploy Domino.
The config.yaml file previously used to deploy the CDK infrastructure.
quay.io credentials provided by Domino.
The SSH private key associated with your bastion host’s Elastic Compute Cloud (EC2) key pair.
A Unix or Linux terminal with the following:
- Node.js installed.
- Python3 installed.
- Amazon Web Services Command Line Interface (AWS CLI) installed.

Upgrade the control plane

Configure your workstation with your credentials:
```
aws configure
```
Go to the cdk directory inside the repository and activate the virtual environment:
```
cd <cdk-cf-eks path>/cdk
source .venv/bin/activate
```
Make sure that both the CDK and the CDK Python libraries are up-to-date relative to the repository:
```
npm list | egrep cdk
pip3 list | egrep aws-cdk
```
If you must update the CDK or the CDK Python libraries, follow steps 1-3 of Provision Infrastructure and Runtime Environment.
Open config.yml and update the following value:
```
`eks.version` to <version-number>
```
Deploy the CDK:
```
cdk deploy
```

Upgrade managed nodes

To upgrade managed node groups, see the official AWS guide.

Upgrade unmanaged nodes

To upgrade unmanaged nodes, you must first connect to the bastion host.

To simplify the commands you’re about to run, fill in and export these variables:

export DEPLOY_NAME=<The name of your deployment>
export AWS_REGION=<The region where you intend to deploy resources>

Get the bastion’s public IP address:

aws cloudformation describe-stacks --stack-name $DEPLOY_NAME --region $AWS_REGION --query "Stacks[0].Outputs[?OutputKey=='bastionpublicip']".OutputValue --output text

Connect to the bastion host:

 ssh -i <your ssh key path> ec2-user@<bastion public ip>

After you’re connected to the bastion host, fill in and export these variables:

export DEPLOY_NAME=<The name of your deployment>
export AWS_REGION=<The region where you intend to deploy resources>
export AWS_ACCESS_KEY_ID=<Your AWS access key ID>
export AWS_SECRET_ACCESS_KEY=<Your AWS secret key>

Download kubectl:

curl -LO https://dl.k8s.io/release/<version-number>bin/linux/amd64/kubectl
chmod +x kubectl
sudo mv kubectl /usr/local/bin/

Output the command to update your deployment’s kubeconfig:

aws cloudformation describe-stacks --stack-name $DEPLOY_NAME
 --region $AWS_REGION
 --query "Stacks[0].Outputs[?OutputKey=='ekskubeconfigcmd']".OutputValue
 --output text | bash -

Run the output from the previous command.

Note
The output from the previous command is unique to every deployment.
Keep the session with the bastion host open.

Replace outdated nodes

To replace outdated nodes, use one of the following methods:

Method A: Remove all the nodes, then set up replacement services. This requires less effort on your part, but does cause longer downtime.
Method B: Set up replacement services, then remove the nodes. This minimizes downtime but requires more effort.

Method A: Remove all the nodes, then set up replacement services.

This method causes up to ten minutes of downtime. It cycles through and terminates every node in every auto-scaling group, then waits for the creation of the replacement nodes. The deployment won’t be available until all the nodes are replaced and all the pods have Ready status.

Perform a test run and audit the results:

> aws autoscaling describe-auto-scaling-groups
--filters Name=tag:eks:cluster-name,Values=$DEPLOY_NAME
--query 'AutoScalingGroups[*].AutoScalingGroupName'
--output text |
xargs -d '	' -n 1 echo aws autoscaling
start-instance-refresh --auto-scaling-group-name

The test run should output something like the following:

aws autoscaling start-instance-refresh --auto-scaling-group-name example-compute-0-us-west-2a
aws autoscaling start-instance-refresh --auto-scaling-group-name example-compute-0-us-west-2b
aws autoscaling start-instance-refresh --auto-scaling-group-name example-compute-0-us-west-2c
aws autoscaling start-instance-refresh --auto-scaling-group-name example-gpu-0-us-west-2a
aws autoscaling start-instance-refresh --auto-scaling-group-name example-gpu-0-us-west-2b
aws autoscaling start-instance-refresh --auto-scaling-group-name example-gpu-0-us-west-2c
aws autoscaling start-instance-refresh --auto-scaling-group-name example-platform-0-us-west-2a
aws autoscaling start-instance-refresh --auto-scaling-group-name example-platform-0-us-west-2b
aws autoscaling start-instance-refresh --auto-scaling-group-name example-platform-0-us-west-2c

Audit the output from the test run. Make sure that the output resembles the example and that the commands behave as expected.

Run the commands from the test run, but this time delete the echo command from the line, `xargs -d ' ' -n 1 echo aws autoscaling `. This will cycle through and replace every node of every auto-scaling group. The pods will restart on the new nodes:

aws autoscaling describe-auto-scaling-groups
--filters Name=tag:eks:cluster-name,Values=$DEPLOY_NAME
--query 'AutoScalingGroups[*].AutoScalingGroupName'
--output text |
xargs -d '	' -n 1 aws autoscaling
start-instance-refresh --auto-scaling-group-name

Method B: Set up replacement services, then remove the nodes.

This method minimizes downtime, although it require more effort. It adds an extra node to every auto-scaling group, individually drains each node, and ensures that replacement services are up before the outdated nodes are removed.

From the AWS console, go to EC2 > Auto Scaling Groups.
For each auto-scaling group with instances prefixed with your deployment name, go to Auto Scaling Groups > Instance management.
Note each instance’s launch template version. If it’s different from the version on the auto-scaling group launch template do the following:
1. Go to Details tab.
2. Click Edit.
3. Increase the Desired capacity by one.
The Private IP DNS name is the name of the node in Kubernetes. To find the Kubernetes node name for each instance, click the Instance management tab and click the instance ID.

To drain the nodes, go to the SSH session and run:

kubectl drain --disable-eviction --delete-emptydir-data --ignore-daemonsets <Private IP DNS name>

Run the following commands periodically to check the status of nodes in the Domino namespace:
```
kubectl get deployments -n domino-compute
kubectl get deployments -n domino-platform
kubectl get deployments -n domino-system
```
Tip
Keep the Instance summary tab open to avoid losing track of which instances to terminate.
Detach the instance.
1. Go to the Instance management tab and select the checkbox next to the instance you want to detach.
2. Go to Actions > Detach.
3. Select the Add a new instance to the Auto Scaling group to balance the load checkbox.
4. Click Detach Instance to confirm.
From the Instance summary console, terminate the instance.