In a shared Spark cluster, it can be challenging for teams to manage their dependencies (for example, Python packages or JARs). Installing every dependency that a Spark application may need before it runs and dealing with version conflicts can be complex and time-consuming.
Domino allows you to easily package and manage dependencies as part of your Spark-enabled compute environments. This approach creates the flexibility to manage dependencies for individual projects or workloads without having to deal with the complexity of a shared cluster.
To add a new dependency, add the appropriate statements in the Docker Instructions section of the relevant Spark and execution compute environments.
For example to add numpy
, include the following.
USER root
### Optionally specify version if desired
RUN pip install numpy
USER ubuntu