Provision Infrastructure

Use this topic to prepare to deploy Domino on VMware Tanzu Kubernetes Grid.

Kubernetes version

The Domino Data Science Platform requires a CNCF-compliant Kubernetes cluster; Tanzu satisfies this requirement. See Kubernetes compatibility to choose a Kubernetes version that is supported with your version of Domino.

Ingress and SSL

Domino must be configured to serve from a specific FQDN, and DNS for that name must resolve to the address of an SSL-termination load balancer with a valid certificate. The load balancer must target incoming connections on ports 80 and 443 to port 80 on all nodes in the Platform pool. This load balancer must support websocket connections.

Health checks for this load balancer should use HTTP on port 80 and check for 200 responses from a path of /health on the nodes.

Storage classes

Domino requires at least two storage classes, one for dynamic block storage and one for long-term shared storage. Domino can use existing storage classes or create them as part of the install.

Block storage

Domino requires high-performance block storage for the following types of data:

  • Ephemeral volumes attached to user execution

  • High-performance databases for Domino application object data

This storage needs to be backed by a storage class with the following properties:

  • Supports dynamic provisioning

  • Can be mounted on any node in the cluster

  • SSD-backed recommended for fast I/O

  • Capable of provisioning volumes of at least 100GB

  • Underlying storage provider can support ReadWriteOnce semantics

  • By default, this storage class is named domino-disk.

We recommend using the default tanzu-vm-storage-policy.

Shared storage

Domino needs a separate storage class for long term storage for:

  • Project data uploaded or created by users

  • Domino Datasets

  • Docker images

  • Domino backups

This storage needs to be backed by a storage class with the following properties:

  • Dynamically provisions Kubernetes PersistentVolume

  • Can be accessed in ReadWriteMany mode from all nodes in the cluster

  • Uses a VolumeBindingMode of Immediate

By default, this storage class is named domino-shared.

We recommend this be backed by vSAN or NFS.

Node pools

Domino requires a minimum of two node pools, one to host the Domino Platform and one to host Compute workloads. Additional optional pools can be added to provide specialized execution hardware for some Compute workloads.

Platform pool - Nodes hosting Domino platform services should have the label ” platform“

Compute pool - Worker nodes for Domino user executions should have the label “ default”

GPU pool - GPU worker nodes should be segregated into a gpu node pool by using the labels “ default-gpu” and “ true”

Node pool labels are added automatically from the TanzuKubernetesCluster spec you provide for defining the cluster. You can verify node pool labels have been applied properly by running the following command after deploying the cluster:

$ kubectl get nodes –show-labels
VM Class requirements

A VM Class can be thought of as a template for VM instances to be created in a Tanzu Kubernetes cluster. VM Classes provide a declarative method of defining your compute resources.

Platform resources

In the Domino Kubernetes cluster you will need a control plane group for the Kubernetes API and control plane resources. This is the Tanzu Kubernetes cluster control plane.

There must also be a domino-platform node pool with at least 4 replicas to host the Domino platform and web services. These are the Domino MLOps platform application nodes.

The solution reference architecture included Tanzu Kubernetes cluster control plane nodes with the best-effort-medium VM Class:

  • 2 vCPU

  • 8 GB RAM

The domino-platform application node-pool uses a best-effort-2xlarge VM Class:

  • 8 vCPU

  • 64 GB RAM

Additionally, domino-platform nodes should be provided with a 100 GB local disk (see yaml below).

You can create custom VM Classes for these groups as well as your compute resources. The sizes above should be considered as a minimum configuration.

Compute resources

Compute VMs (worker nodes) in the cluster are where Domino user workloads can be run. These should have at a minimum the following specifications:

  • 8 vCPU

  • 32 GB RAM

400 GB local disk should be added to compute nodes to provide for a variety of workspace environment sizes and user work preferences.

GPU enabled workers

Compute instances with GPU resources are created using the same virtual machine classes. For more details on how VM Classes are created and used in Tanzu, see the following doc: Virtual Machine Classes for Tanzu Kubernetes Clusters

We specify the storage resources including storage class, number and size of local disks for the domino-platform and domino-compute nodes when we define the Tanzu Kubernetes cluster in the TanzuKubernetesCluster spec YAML file.

For most deployments, compute resources will be substantially larger than these minimum specifications. It is recommended that you contact your Domino account representative or sales engineer to help you get a more accurate estimate of your sizing needs. A sizing guide for the Domino Platform is also available online on the Domino documentation website.

For more information on managing GPU and compute resources in Domino, see Manage Compute Resources.

Tanzu Kubernetes cluster with vGPU access

Here is an example TanzuKubernetesCluster spec YAML file from the reference architecture that can be applied in the Tanzu supervisor cluster to create a suitable simple Tanzu Kubernetes cluster containing NVIDIA GPUs:

kind: TanzuKubernetesCluster
  name: tkg-cluster-vgpu-domino
  namespace: domino-ns
      replicas: 4
      storageClass: vsan-r1
          name: v1.20.8---vmware.1-tkg.2
      vmClass: best-effort-medium
    - name: domino-platform
      labels: "platform"
      replicas: 4
      vmClass: best-effort-2xlarge
      storageClass: vsan-r1
      - name: var-lib
        mountPath: /var/lib
          storage: 128Gi
    - name: mig-20
      labels: "mig20-gpu"
      replicas: 1
      storageClass: vsan-r1
          name: v1.20.8---vmware.1-tkg.2
      vmClass: 16vcpu-64gram-mig-20c-vmxnet3
      - name: var-lib
        mountPath: /var/lib
          storage: 400Gi
    - name: mig-40
      labels: "default-gpu"
      replicas: 2
      storageClass: vsan-r1
          name: v1.20.8---vmware.1-tkg.2
      vmClass: 16vcpu-64gram-mig-40c-vmxnet3
      - name: var-lib
        mountPath: /var/lib
          storage: 400Gi
    - name: nongpuworkers
      labels: "default"
        domino/build-node: "true"
      replicas: 1
      storageClass: vsan-r1
          name: v1.20.8---vmware.1-tkg.2
      vmClass: best-effort-2xlarge
      - name: var-lib
        mountPath: /var/lib
          storage: 400Gi