Scale distributed workloads

Use Domino to create on-demand Spark, Dask, Ray, or MPI compute clusters to speed up computationally-intensive jobs. Execute your jobs in any cloud or on-prem cluster to preserve data locality and optimize spend.

This article contains an overview and examples for compute clusters in Domino. Learn how to do the following:

Enable clusters in your Domino deployment.
Use Domino to orchestrate distributed and parallel training workloads.

Enable and configure cluster deployments

Before you use on-demand clusters, enable them in your workspace and create a base cluster image:

Configure Spark clusters for your Domino deployment.
Configure Ray clusters for your Domino deployment.
Configure Dask clusters for your Domino deployment.
Configure MPI clusters on your Domino deployment.

Distributed and parallel training

Generally, there are two ways you can use compute clusters to train models in Domino:

As the compute environment for interactive workspace such as Jupyter Notebooks (or any other IDE) running on top of the cluster.
As a job-based compute cluster that executes a training script or job you define.

Typically, interactive workspaces are used to explore datasets and training approaches. In contrast, use the job-based method after you’ve developed a training approach and want to repeat it.

Select the cluster type to learn more. For more information on choosing a cluster type, see our blog post Spark, Dask, and Ray: Choosing the right framework.

Spark provides a simple way to parallelize compute-heavy workloads such as distributed training. Spark benefits iterative training algorithms or multi-threaded tasks over large data sets.

Domino supports fully containerized executions of Spark workloads on the Domino Kubernetes cluster. You can interact with Spark through Domino in the following ways:

Use Spark in an interactive workspace.
Use Spark in batch mode through a Domino job.
Directly with spark-submit.

When you start a workspace or a job that uses an on-demand cluster, Domino orchestrates a cluster in standalone mode. The master and workers are newly deployed containers, and the driver is your Domino workspace or job.

See the Spark quickstart project to walk through environment setup, project creation, and model training.

Domino also provides access to GPU-accelerated backend compute for the Spark workers. Combined with the RAPIDS Accelerator for Spark, you can enable GPU-accelerated processing on the Spark worker nodes. For more information, see the Webinar for GPU-accelerated Spark and RAPIDS.

Next steps

Now that you know the concepts behind using Spark, Dask, and Ray to configure clusters for jobs, see how to Tune Models with Ray Tune.

User Guide

Admin Guide

API Guide

Release Notes

Scale distributed workloads

Enable and configure cluster deployments

Distributed and parallel training

Next steps