You can use Domino to dynamically provision and orchestrate a Spark cluster directly on the infrastructure that backs the Domino instance. This gives you quick access to Spark without having to rely on your IT team.
Domino supports fully containerized execution of Spark workloads on the Domino Kubernetes cluster. You can interact with Spark through a Domino workspace or in batch mode through a Domino job, as well as directly with spark-submit.
When you start a workspace or a job that uses an on-demand cluster, Domino orchestrates a cluster in standalone mode. The master and workers are newly deployed containers and the driver is your Domino workspace or job.
The Domino on-demand Spark cluster is suitable for the following workloads:
- Distributed machine learning
Easily parallelize compute heavy workloads such as distributed training or hyper-parameter tuning. Spark comes with powerful machine learning algorithms bundled in MLlib for this purpose.
- Interactive exploratory analysis
Efficiently load a large dataset in a distributed manner to explore and understand the data using familiar query techniques with Spark SQL.
- Featurization and data transformation (for experienced Spark users)
Sample, aggregate, relabel, or otherwise manipulate a large datasets to make them more suitable for analysis or training.Note
The following are usage patterns that are presently not suitable for on-demand Spark on Domino:
- Stream processing pipeline
While Spark offers a robust stream processing engine, the ephemeral nature of the on-demand clusters on Domino makes it not a great fit for long-lived stream processing applications.
For such cases, consider using an externally managed Spark cluster.
- Collocated Spark and HDFS
The Domino on-demand clusters do not come with an HDFS installation and are generally not suitable for collocating data and compute.
Data in Domino clusters is intended to reside outside the cluster (for example, object store or Domino data set). When you want to use the cluster as long-term HDFS storage, consider using an externally managed Spark cluster.
- Data pipelines with strict performance SLA
While Domino orchestrates Spark on Kubernetes reliably, no extensive performance tuning or optimization has been performed. The cluster configuration and default context configuration parameters might not be optimized for such workloads.Note