NVIDIA DGX systems can run Domino workloads if they are added to your Kubernetes cluster as compute (worker) nodes. This topic covers how to setup and add DGXes to Domino.
The flow chart begins from the top left, with a Domino end user requesting a GPU tier.
If a DGX is already configured for use in Domino’s Compute Grid, the Domino platform administrator can define a GPU-enabled Hardware Tier from within the Admin console.
The middle lane of the flow chart outlines the steps required to integrate a provisioned DGX system as a node in the Kubernetes cluster that is hosting Domino, and subsequently configure that node as a GPU-enabled component of Domino’s compute grid.
The bottom swim lane outlines that, to leverage a Nvidia DGX system with Domino, it must be purchased and provisioned into the target infrastructure stack hosting Domino.
If this is a new (greenfield) deployment of Domino, you must first install and configure a Kubernetes cluster that meets Domino’s Cluster Requirements, including valid configuration of your Kubernetes' network policies to support secure communication between pods that will host Domino’s platform services and compute grid.
Additionally, proper taints must be added to your DGX node. This facilitates the selection of the DGX for GPU-based workloads running on Domino.
Configuration of the Nvidia driver at the host level must be performed by your Server administrator. The correct Nvidia driver for your host can be identified by using the configuration guide found here. More information can be found in the DGX Systems Documentation.
The CUDA software version required for a given development framework, such as Tensorflow, will be documented on their website. For example, Tensorflow >=2.1 requires CUDA 10.1 and some additional software packages, for example, CuDNN.
CUDA & Nvidia Driver Compatibility
After the correct CUDA version is identified for your specific needs, one must consult the CUDA-Nvidia Driver Compatibility Table.
In the Tensorflow 2.1 example, the CUDA 10.1 requirement means one must be running CUDA >=10.1 and Nvidia driver >=410.48 on the host. Table 1 in the previous link will guide your choice of matching CUDA & Nvidia driver versions.
Subsequently, the Domino Compute Environment must be configured to leverage the exact CUDA version that corresponds to the desired application.
Simplifying this constraint, CUDA drivers provide backwards compatibility: the CUDA version on the host can be greater or equal to that which is specified in your Compute Environment.
And because the CUDA software installation process often returns unexpected results when attempting to install an exact CUDA version, including patch version, the fastest route to a functioning configuration is typically to install the latest available minor release from your required major version of CUDA, and subsequently creating a Docker environment variable (ENV) from within your Compute Environment that constrains compatible sets of CUDA, GPU generations, and Nvidia drivers.
Need Additional Assistance?
Consult your Domino customer success engineer for guidance on your specific needs. Domino can sample configurations that will simplify your configuration process.
Build NodeDomino recommends you do not use a DGX GPU as a build node for environments. Instead, opt for a CPU resource as part of your overall Domino architecture.
Splitting GPUs per TierDomino recommends providing several GPU tiers with different numbers of GPUs in each tier for example, 1, 2, 4, and 8 GPU hardware tiers as different training jobs can take use of single or parallel GPU usage and consuming a whole DGX box for one workload might not be feasible in your environment.
After splitting up hardware tiers, access can beglobal or, alternatively,limited to specific organizations. Domino recommends ensuring that the right organizations have GPU Hardware Tier access --or are restricted-- for the purpose of ensuring availability for critical work, and/or to prevent the unauthorized use of GPU tiers.