Configure Spark prerequisites

Before you can start using on-demand Spark clusters on Domino, you must enable and configure the functionality on your deployment.

Create a base Spark cluster environment

By default, Domino provides Spark compatible images, as listed in compute Environment catalog, that can be used for the components of the cluster.

When using on-demand Spark in Domino, you need one environment for the Spark cluster (base or worker environment) and one environment for the workspace/job execution (compute environment).

Create a new base Spark cluster environment

  1. Follow the instructions to create an environment.

  2. In the Base Image section, select Custom Image and specify an image URI that points to a deployable Spark image.

    Domino recommends that you use a Domino-provided Spark image from compute Environment catalog for versions of Spark and Python.

    Note

    Image compatibility:

    Domino currently republishes the Spark base images from bitnami/spark.

    Domino’s on-demand Spark functionality has been developed and tested using open-source Spark images from Bitnami.

  3. Required: In the Supported Clusters area, select the Domino managed Spark checkbox.

    This ensures that the environment is available for use when you create Spark clusters from workspaces and jobs.

  4. Set the Visibility.

    You can set this attribute the same way you would for any other compute environment.

  5. Leave the Dockerfile Instructions blank to use the Hadoop client libraries included with the image or follow the instructions to configure custom Hadoop client libraries.

    Domino-provided images are installed with all pre-requisites.

    See Manage dependencies to learn more.

  6. Leave Pluggable Notebooks / Workspace Sessions blank as the Spark base environments are not intended to also include notebook configuration.

Base Spark cluster environment - default Hadoop client libraries

Leave the Docker Instructions section blank if you want a thin base image that only contains core Spark with the default Hadoop client libraries.

Base Spark cluster environment (advanced) - custom Hadoop client libraries

The Hadoop client libraries, pre-bundled with your Spark version, might not be appropriate for your needs. This is common if you want to use cloud object store connector improvements introduced post Hadoop 2.7.

Add the following to the Docker Instructions section, and adjust the Spark and Hadoop version as needed:

### Needed if using the recommended Bitnami base image
USER root

### Make sure wget is available
RUN apt-get update && apt-get install -y wget && rm -r /var/lib/apt/lists /var/cache/apt/archives

### Modify the Hadoop and Spark versions below as needed
### NOTE: The HADOOP_HOME and SPARK_HOME locations should not be modified
ENV HADOOP_VERSION=3.1.1
ENV HADOOP_HOME=/opt/bitnami/hadoop
ENV HADOOP_CONF_DIR=/opt/bitnami/hadoop/etc/hadoop
ENV SPARK_VERSION=3.2.0
ENV SPARK_HOME=/opt/bitnami/spark
ENV PATH="$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin"

### Enable access to AWS and ADLS Gen2. Can modify as needed
ENV HADOOP_OPTIONAL_TOOLS="hadoop-aws,hadoop-azure,hadoop-azure-datalake"

### Remove the pre-installed Spark since it is pre-bundled with Hadoop but preserves the Python env
WORKDIR /opt/bitnami
RUN [ -d ${SPARK_HOME}/venv ] && mv ${SPARK_HOME}/venv /opt/bitnami/temp-venv
RUN rm -rf ${SPARK_HOME}

### Install the desired Hadoop-free Spark distribution
RUN wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz &&
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz &&
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz &&
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} &&
    chmod -R 777 ${SPARK_HOME}/conf

### Restore the virtual Python environment
RUN [ -d /opt/bitnami/temp-venv ] && mv /opt/bitnami/temp-venv ${SPARK_HOME}/venv

### Install the desired Hadoop libraries
RUN wget -q http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz &&
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz &&
    rm hadoop-${HADOOP_VERSION}.tar.gz &&
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

### Setup the Hadoop libraries classpath
RUN echo 'export SPARK_DIST_CLASSPATH="$(hadoop classpath):'"${HADOOP_HOME}"'/share/hadoop/tools/lib/*"' >> ${SPARK_HOME}/conf/spark-env.sh
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$HADOOP_HOME/lib/native"

### This is important to maintain compatibility with Bitnami
WORKDIR /
RUN /opt/bitnami/scripts/spark/postunpack.sh
WORKDIR ${SPARK_HOME}

USER 1001

Prepare your PySpark execution compute environment

You must configure the PySpark compute environments for workspaces and/or jobs that will connect to your cluster.

Domino recommends that you use the pre-built base image to create a compatible workspace from compute Environment catalog.

Customize this Workspace compute environment

Use the image mentioned previously and add Pluggable Workspace Tools.

jupyter:
  title: "Jupyter (Python, R, Julia)"
  iconUrl: "/assets/images/workspace-logos/Jupyter.svg"
  start: [ "/opt/domino/workspaces/jupyter/start" ]
  supportedFileExtensions: [ ".ipynb" ]
  httpProxy:
    port: 8888
    rewrite: false
    internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
    requireSubdomain: false
jupyterlab:
  title: "JupyterLab"
  iconUrl: "/assets/images/workspace-logos/jupyterlab.svg"
  start: [  "/opt/domino/workspaces/jupyterlab/start" ]
  httpProxy:
    internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
    port: 8888
    rewrite: false
    requireSubdomain: false
vscode:
  title: "vscode"
  iconUrl: "/assets/images/workspace-logos/vscode.svg"
  start: [ "/opt/domino/workspaces/vscode/start" ]
  httpProxy:
    port: 8888
    requireSubdomain: false
rstudio:
  title: "RStudio"
  iconUrl: "/assets/images/workspace-logos/Rstudio.svg"
  start: [ "/opt/domino/workspaces/rstudio/start" ]
  httpProxy:
    port: 8888
    requireSubdomain: false