Using PySpark in Jupyter Workspaces

Overview

You can configure a Domino Workspace to launch a Jupyter notebook with a connection to your Spark cluster.

This allows you to operate the cluster interactively from Jupyter with PySpark.

The instructions for configuring a PySpark Workspace are below. To use them, you must have a Domino environment that meets the following prerequisites:

Note

PySpark 2 does not support Python 3.8 or higher. Build PySpark 2 compute environments from images with Python before 3.8 or use PySpark 3.

Adding a PySpark Workspace option to your environment

  1. From the Domino main menu, click Environments.

  2. Click the name of an environment that meets the prerequisites listed above. It must use a Domino standard base image and already have the necessary binaries and configuration files installed for connecting to your spark cluster.

  3. On the environment overview page, click Edit Definition.

  4. In the Pluggable Workspace Tools field, paste the following YAML configuration.

    pyspark:
       title: "PySpark"
       start: [ "/var/opt/workspaces/pyspark/start" ]
       iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png"
       httpProxy:
          port: 8888
          internalPath: "/{{ownerUsername}}/{{projectName}}/{{sessionPathComponent}}/{{runId}}/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
          rewrite: false
          requireSubdomains: false
       supportedFileExtensions: [ ".ipynb" ]
    

    When finished, the field should look like this.

    pyspark-pluggable-workspace-tools.png

  5. Click Build to apply the changes and build a new version of the environment. Upon a successful build, the environment is ready for use.

Note

If you are using an older version of a Domino Standard Environment you may require a different Pluggable Workspace Tool definition for PySpark. The safest way to do this is to copy the Jupyter pluggable workspace definition for your base image (see the README for your base image at the Domino Analytics Distribution git repo) but replace Jupyter in the title and start fields with PySpark. You can use the same iconUrl specified above to get the correct PySpark icon.

Launching PySpark Workspaces

  1. Open the project you want to use a PySpark Workspace in.

  2. Open the project settings, then follow the provider-specific instructions from the Hadoop and Spark overview on setting up a project to work with an existing Spark connection environment. This will involve enabling YARN integration in the project settings.

  3. On the Hardware & Environment tab of the project settings, choose the environment you added a PySpark configuration to in the previous section.

  4. Once the above settings are applied, you can launch a PySpark Workspace from the Workspaces dashboard.

    pyspark-workspace-selection.png