Apache Spark is a fast and general-purpose cluster computing system that offers a unified analytics engine for large-scale data processing and machine learning.
Domino provides flexibility on how to use Spark. You can dynamically provision an on-demand Spark cluster orchestrated by Domino or you can connect to an existing Spark cluster outside of Domino.
- On-demand Spark
-
Use Domino to dynamically provision and orchestrate a Spark cluster directly on the infrastructure that backs the Domino instance.
- Hadoop and Spark
-
Domino projects can use the environment to work with Hadoop applications.
Spark clusters can use Spot instances to save the infrastructure costs. We recommend to use Spot instances only for the driver nodes as they can recover in case of failure. For Master note, always use on-demand nodes.
If AWS interrupts a spot instance, the on-demand or scheduled job on the Spark cluster may slow down the execution. If this happens, and until AWS spot instances of the requested type become available again, the remediation is to change the hardware tier of the job to use a non-spot node pool.