Using PySpark in Jupyter Workspaces

Overview

You can configure a Domino Workspace to launch a Jupyter notebook with a connection to your Spark cluster. This allows you to operate the cluster interactively from Jupyter with PySpark.

The instructions for configuring a PySpark Workspace are below. To use them, you must have a Domino environment that meets the following prerequisites:

Adding a PySpark Workspace option to your environment

  1. From the Domino main menu, click Environments.

  2. Click the name of an environment that meets the prerequisites listed above. It must use a Domino standard base image and already have the necessary binaries and configuration files installed for connecting to your spark cluster.

  3. On the environment overview page, click Edit Definition.

  4. In the Pluggable Workspace Tools field, paste the following YAML configuration.

    pyspark:
      title: "PySpark"
      start: [ "/var/opt/workspaces/pyspark/start" ]
      iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png"
      httpProxy:
        port: 8888
        rewrite: false
        internalPath: "/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
      supportedFileExtensions: [ ".ipynb" ]
    

    When finished, the field should look like this.

    Screen_Shot_2019-04-25_at_11.43.37_AM.png

  5. Click Build to apply the changes and build a new version of the environment. Upon a successful build, the environment is ready for use.

Launching PySpark Workspaces

  1. Open the project you want to use a PySpark Workspace in.

  2. Open the project settings, then follow the provider-specific instructions from the Hadoop and Spark overview on setting up a project to work with an existing Spark connection environment. This will involve enabling YARN integration in the project settings.

  3. On the Hardware & Environment tab of the project settings, choose the environment you added a PySpark configuration to in the previous section.

  4. Once the above settings are applied, you can launch a PySpark Workspace from the Workspaces dashboard.

    Screen_Shot_2019-04-25_at_11.52.48_AM.png