On-demand Spark

You can use Domino to dynamically provision and orchestrate a Spark cluster directly on the infrastructure that backs the Domino instance. This gives you quick access to Spark without having to rely on your IT team.

Orchestrate Spark on Domino

Domino supports fully containerized execution of Spark workloads on the Domino Kubernetes cluster. You can interact with Spark through a Domino workspace or in batch mode through a Domino job, as well as directly with spark-submit.

When you start a workspace or a job that uses an on-demand cluster, Domino orchestrates a cluster in standalone mode. The master and workers are newly deployed containers and the driver is your Domino workspace or job.

Suitable use cases

The Domino on-demand Spark cluster is suitable for the following workloads:

Distributed machine learning

Spark provides a simple way to parallelize compute-heavy workloads such as distributed training. Spark comes with powerful machine learning algorithms bundled in MLlib for this purpose.

Interactive exploratory analysis

Efficiently load a large dataset in a distributed manner to explore and understand the data using familiar query techniques with Spark SQL.

Featurization and data transformation (for experienced Spark users)

Sample, aggregate, relabel, or otherwise manipulate large datasets to make them more suitable for analysis or training.

Note	For optimal performance, you must have a cluster with sufficient resources and you must be adept at tuning your Spark application and writing performant Spark transforms.

Unsuitable use cases

The following are usage patterns that are presently not suitable for on-demand Spark on Domino:

Stream processing pipeline

While Spark offers a robust stream processing engine, the ephemeral nature of the on-demand clusters on Domino makes it not a great fit for long-lived stream processing applications.

For such cases, consider using an externally managed Spark cluster.

Collocated Spark and HDFS

The Domino on-demand clusters do not come with an HDFS installation and are generally not suitable for collocating data and compute.

Data in Domino clusters is intended to reside outside the cluster (for example, object store or Domino data set). When you want to use the cluster as long-term HDFS storage, consider using an externally managed Spark cluster.

Data pipelines with strict performance SLA

While Domino orchestrates Spark on Kubernetes reliably, no extensive performance tuning or optimization has been performed. The cluster configuration and default context configuration parameters might not be optimized for such workloads.

Note	If you intend to explore on-demand Domino spark clusters for such workloads, perform extensive validation and tuning of your jobs.