You can use Domino to dynamically provision and orchestrate a Spark cluster directly on the infrastructure that backs the Domino instance. This gives you quick access to Spark without having to rely on your IT team.
Domino supports fully containerized execution of Spark workloads on the Domino Kubernetes cluster. You can interact with Spark through a Domino workspace or in batch mode through a Domino job, as well as directly with spark-submit.
When you start a workspace or a job that uses an on-demand cluster, Domino orchestrates a cluster in standalone mode. The master and workers are newly deployed containers and the driver is your Domino workspace or job.
The Domino on-demand Spark cluster is suitable for the following workloads:
- Distributed machine learning
-
Spark provides a simple way to parallelize compute-heavy workloads such as distributed training. Spark comes with powerful machine learning algorithms bundled in MLlib for this purpose.
- Interactive exploratory analysis
-
Efficiently load a large dataset in a distributed manner to explore and understand the data using familiar query techniques with Spark SQL.
- Featurization and data transformation (for experienced Spark users)
-
Sample, aggregate, relabel, or otherwise manipulate large datasets to make them more suitable for analysis or training.
NoteFor optimal performance, you must have a cluster with sufficient resources and you must be adept at tuning your Spark application and writing performant Spark transforms.
The following are usage patterns that are presently not suitable for on-demand Spark on Domino:
- Stream processing pipeline
-
While Spark offers a robust stream processing engine, the ephemeral nature of the on-demand clusters on Domino makes it not a great fit for long-lived stream processing applications.
For such cases, consider using an externally managed Spark cluster.
- Collocated Spark and HDFS
-
The Domino on-demand clusters do not come with an HDFS installation and are generally not suitable for collocating data and compute.
Data in Domino clusters is intended to reside outside the cluster (for example, object store or Domino data set). When you want to use the cluster as long-term HDFS storage, consider using an externally managed Spark cluster.
- Data pipelines with strict performance SLA
-
While Domino orchestrates Spark on Kubernetes reliably, no extensive performance tuning or optimization has been performed. The cluster configuration and default context configuration parameters might not be optimized for such workloads.
NoteIf you intend to explore on-demand Domino spark clusters for such workloads, perform extensive validation and tuning of your jobs.