On-Demand Spark Overview

Overview

Domino offers the ability to dynamically provision and orchestrate a Spark cluster directly on the infrastructure backing the Domino instance. This allows Domino users to get quick access to Spark without having to rely on their IT team to create and manage one for them.

Orchestrating Spark on Domino

Domino supports fully containerized execution of Spark workloads on the Domino Kubernetes cluster. Users can interact with Spark interactively through a Domino workspace or in batch mode through a Domino job as well as directly with spark-submit.

When you start a workspace or a job that uses an on-demand cluster, Domino orchestrates a cluster in Standalone mode. The master and workers are newly deployed containers and the driver is your Domino workspace or job.

Suitable use cases

The Domino on-demand Spark cluster is suitable for the following workloads:

  • Distributed machine learning

    Easily parallelize compute heavy workloads such as distributed training or hyper-parameter tuning. Spark comes with powerful machine learning algorithms bundled in MLlib for this purpose.

  • Interactive exploratory analysis

    Efficiently load a large data set in a distributed manner in order to explore and understand the data using familiar query techniques with Spark SQL.

  • Featurization and data transformation (for experienced Spark users)

    Sample, aggregate, relabel, or otherwise manipulate a large data sets to make it more suitable for analysis or training.

    Note

    Optimal performance requires a cluster with sufficient resources and a data science practitioner who is adept at tuning their Spark application and writing performant Spark transforms.

Unsuitable use cases

The following are usage patterns that are presently not suitable for on-demand Spark on Domino:

  • Stream processing pipeline

    While Spark itself offers a robust stream processing engine, the ephemeral nature of the on-demand clusters on Domino, makes it not a great fit for long-lived stream processing applications.

    For such cases, you should consider using an externally managed Spark cluster.

  • Collocated Spark and HDFS

    The Domino on-demand clusters do not come with an HDFS installation and are generally not suitable for collocating data and compute.

    Data in Domino clusters is intended to reside outside the cluster (e.g. object store or Domino data set). For cases where it is desirable to use the cluster as long term HDFS storage, you should consider using an externally managed Spark cluster.

  • Data pipelines with strict performance SLA

    While Domino orchestrates Spark on Kubernetes in a reliable way, no extensive performance tuning or optimization has been performed. The cluster configuration and default context configuration parameters may not be optimized for such workloads.

    Note

    If you intend to explore on-demand Domino spark clusters for such workloads you should perform extensive validation and tuning of your jobs.