Use this topic to prepare to deploy Domino on VMware Tanzu Kubernetes Grid.
The Domino Data Science Platform requires a CNCF-compliant Kubernetes cluster; Tanzu satisfies this requirement. See Kubernetes compatibility to choose a Kubernetes version that is supported with your version of Domino.
Domino must be configured to serve from a specific FQDN, and the DNS for that name must resolve to the address of an SSL-termination load balancer with a valid certificate. The load balancer must target incoming connections on ports 80
and 443
to port 80
on all nodes in the Platform pool. This load balancer must support websocket connections.
Health checks for this load balancer should use HTTP on port 80
and check for 200
responses from a path of /health
on the nodes.
Domino requires at least two storage classes; one for dynamic block storage and one for long-term shared storage. Domino can use existing storage classes or create them as part of the install.
Domino requires high-performance block storage for the following types of data:
-
Ephemeral volumes attached to user execution.
-
High-performance databases for Domino application object data.
This storage needs to be backed by a storage class with the following properties:
-
Supports dynamic provisioning.
-
Can be mounted on any node in the cluster.
-
SSD-backed recommended for fast I/O.
-
Capable of provisioning volumes of at least 100 GB.
-
The underlying storage provider can support ReadWriteOnce semantics.
-
By default, this storage class is named
domino-disk
.
We recommend using the default tanzu-vm-storage-policy
.
Domino needs a separate storage class for long-term storage for:
-
Project data uploaded or created by users.
-
Domino Datasets.
-
Docker images.
-
Domino backups.
This storage needs to be backed by a storage class with the following properties:
-
Dynamically provisions Kubernetes
PersistentVolume
. -
Can be accessed in
ReadWriteMany
mode from all nodes in the cluster. -
Uses a
VolumeBindingMode
ofImmediate
.
By default, this storage class is named domino-shared
.
We recommend this be backed by vSAN or NFS.
Domino requires a minimum of two node pools; one to host the Domino Platform and one to host Compute workloads. Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.
-
Platform pool - Nodes hosting Domino platform services should have the label
"dominodatalab.com/node-pool: platform"
. -
Compute pool - Worker nodes for Domino user executions should have the label
"dominodatalab.com/node-pool: default"
. -
GPU pool - GPU worker nodes should be segregated into a GPU node pool by using the labels
"dominodatalab.com/node-pool: default-gpu"
and"nvidia.com/gpu: true"
.
Node pool labels are added automatically from the TanzuKubernetesCluster
spec you provide for defining the cluster. You can verify that node pool labels have been applied properly by running the following command after deploying the cluster:
$ kubectl get nodes –show-labels
A VM Class can be thought of as a template for VM instances to be created in a Tanzu Kubernetes cluster. VM Classes provide a declarative method of defining your compute resources.
In the Domino Kubernetes cluster you will need a control plane group for the Kubernetes API and control plane resources. This is the Tanzu Kubernetes cluster control plane.
There must also be a domino-platform node pool with at least four replicas to host the Domino platform and web services. These are the Domino MLOps platform application nodes.
The solution reference architecture includes Tanzu Kubernetes cluster control plane nodes with the best-effort-medium
VM Class:
-
2 vCPU
-
8 GB RAM
The domino-platform application node-pool uses a best-effort-2xlarge
VM Class:
-
8 vCPU
-
64 GB RAM
Additionally, domino-platform nodes should be provided with a 100 GB local disk (see YAML below).
You can create custom VM Classes for these groups as well as your compute resources. The sizes above should be considered as a minimum configuration.
Compute VMs (worker nodes) in the cluster are where Domino user workloads can be run. These should have at least the following specifications:
-
8 vCPU
-
32 GB RAM
A 400 GB local disk should be added to compute nodes to provide for a variety of workspace environment sizes and user work preferences.
Compute instances with GPU resources are created using the same VM Classes. For more details on how VM Classes are created and used in Tanzu, see Virtual Machine Classes for Tanzu Kubernetes Clusters.
We specify the storage resources, including the storage class, number and size of local disks for the domino-platform, and domino-compute nodes when we define the Tanzu Kubernetes cluster in the TanzuKubernetesCluster
spec YAML file.
For most deployments, compute resources will be substantially larger than these minimum specifications. It is recommended that you contact your Domino account representative or sales engineer to help you get a more accurate estimate of your sizing needs. A sizing guide for the Domino Platform is also available online on the Domino documentation website.
For more information on managing GPU and compute resources in Domino, see Manage Compute Resources.
Here is an example TanzuKubernetesCluster
spec YAML file from the reference architecture that can be applied in the Tanzu supervisor cluster to create a suitable simple Tanzu Kubernetes cluster containing NVIDIA GPUs:
apiVersion: run.tanzu.vmware.com/v1alpha2
kind: TanzuKubernetesCluster
metadata:
name: tkg-cluster-vgpu-domino
namespace: domino-ns
spec:
topology:
controlPlane:
replicas: 4
storageClass: vsan-r1
tkr:
reference:
name: v1.20.8---vmware.1-tkg.2
vmClass: best-effort-medium
nodePools:
- name: domino-platform
labels:
dominodatalab.com/node-pool: "platform"
replicas: 4
vmClass: best-effort-2xlarge
storageClass: vsan-r1
volumes:
- name: var-lib
mountPath: /var/lib
capacity:
storage: 128Gi
- name: mig-20
labels:
dominodatalab.com/node-pool: "mig20-gpu"
replicas: 1
storageClass: vsan-r1
tkr:
reference:
name: v1.20.8---vmware.1-tkg.2
vmClass: 16vcpu-64gram-mig-20c-vmxnet3
volumes:
- name: var-lib
mountPath: /var/lib
capacity:
storage: 400Gi
- name: mig-40
labels:
dominodatalab.com/node-pool: "default-gpu"
replicas: 2
storageClass: vsan-r1
tkr:
reference:
name: v1.20.8---vmware.1-tkg.2
vmClass: 16vcpu-64gram-mig-40c-vmxnet3
volumes:
- name: var-lib
mountPath: /var/lib
capacity:
storage: 400Gi
- name: nongpuworkers
labels:
dominodatalab.com/node-pool: "default"
domino/build-node: "true"
replicas: 1
storageClass: vsan-r1
tkr:
reference:
name: v1.20.8---vmware.1-tkg.2
vmClass: best-effort-2xlarge
volumes:
- name: var-lib
mountPath: /var/lib
capacity:
storage: 400Gi