You can configure a Domino Workspace to launch a Jupyter notebook with a connection to your Spark cluster.
This allows you to operate the cluster interactively from Jupyter with PySpark.
The instructions for configuring a PySpark Workspace are below. To use them, you must have a Domino environment that meets the following prerequisites:
-
The environment must use one of the Domino Standard Environments as its base image.
-
The necessary binaries and configurations for connecting to your Spark cluster must be installed in the environment. See the provider-specific guides for setting up the environment.
-
From the Domino main menu, click Environments.
-
Click the name of an environment that meets the prerequisites listed above. It must use a Domino standard base image and already have the necessary binaries and configuration files installed for connecting to your spark cluster.
-
On the environment overview page, click Edit Definition.
-
In the Pluggable Workspace Tools field, paste the following YAML configuration.
pyspark: title: "PySpark" start: [ "/var/opt/workspaces/pyspark/start" ] iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png" httpProxy: port: 8888 rewrite: false internalPath: "/{{ supportedFileExtensions: [ ".ipynb" ]
When finished, the field should look like this.
-
Click Build to apply the changes and build a new version of the environment. Upon a successful build, the environment is ready for use.
-
Open the project you want to use a PySpark Workspace in.
-
Open the project settings, then follow the provider-specific instructions on setting up a project to work with an existing Spark connection environment. This will involve enabling YARN integration in the project settings.
-
On the Hardware & Environment tab of the project settings, choose the environment you added a PySpark configuration to in the previous section.
-
Once the above settings are applied, you can launch a PySpark Workspace from the Workspaces dashboard.