domino logo
Tech Ecosystem
Get Started
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get Started with MATLAB
Step 1: Orient yourself to DominoStep 2: Create a ProjectStep 3: Configure Your ProjectStep 4: Start a MATLAB WorkspaceStep 5: Fetch and Save Your DataStep 6: Develop Your ModelStep 7: Clean Up Your Workspace
Step 8: Deploy Your Model
Scheduled JobsLaunchers
Step 9: Working with Datasets
Domino Reference Projects
Search in Deployments
Security and Credentials
Secure Credential Storage
Store Project CredentialsStore User CredentialsStore Model Credentials
Get API KeyUse a Token for AuthenticationCreate a Mirror of Compute Environments
Collaborate
Share and Collaborate on Projects
Set Project VisibilityInvite CollaboratorsCollaborator Permissions
Add Comments
Reuse Work
Set Up ExportsSet Up Imports
Organizations
Organization PermissionsTransfer Projects to an Organization
Projects
Domino File System Projects
Domino File SystemOrganize Domino File System Project AssetsImport Git RepositoriesWork from a Commit ID in GitCopy a ProjectFork ProjectsMerge Projects
Manage Project Files
Upload Files to DominoCompare File RevisionsExclude Project Files From SyncExport Files as Python or R Package
Archive a Project
Revert Projects and Files
Revert a FileRevert a Project
Git-based Projects
Git-based Project Directory StructureCreate a Git-based ProjectCreate a New RepositoryOrganize Git-based Project AssetsDevelop Models in a WorkspaceSave Artifacts to the Domino File System
Project FilesSet Project SettingsStore Project Credentials
Project Goals
Add GoalsEdit GoalsLink Work to Goals
Organize Projects with TagsSet Project Stages
Project Status
Set Project as BlockedSet Project as CompleteSet Project as Unblocked
View Execution DetailsView Project ActivityTrack Project StatusRename a Project
Share and Collaborate
Set Project VisibilityInvite CollaboratorsCollaborator Permissions
Export and Import Project Content
Set Up ExportsSet Up Imports
See the Assets for Your ProjectPromote Projects to ProductionTransfer Project OwnershipIntegrate Jira
Domino Datasets
Manage Large DataDatasets Best PracticesCreate a DatasetUse an Existing DatasetFile Location of Datasets in Projects
Datasets and Snapshots
Update a DatasetAdd Tags to SnapshotsCreate a Snapshot of a DatasetDelete Snapshots of DatasetsDelete a Dataset
Upgrade from Versions Prior to 4.5
External Data
Considerations for Connecting to Data Sources
External Data Volumes
Mount an External VolumeView Mounted VolumesUse a Mounted VolumeUmount a Volume
Tips: Transfer Data Over a Network
Workspaces
Create a Workspace
Open a VS Code WorkspaceSet Custom Preferences for RStudio Workspaces
Workspace Settings
Edit Workspace SettingsChange Your Workspace's Volume SizeConfigure Long-Running Workspaces
Save Work in a WorkspaceSync ChangesView WorkspacesStop a WorkspaceResume a WorkspaceDelete a WorkspaceView Workspace LogsView Workspace UsageView Workspace HistoryWork with Legacy Workspaces
Use Git in Your Workspace
Commit and Push Changes to Your Git RepositoryCommit All Changes to Your Git RepositoryPull the Latest Changes from Your Git Repository
Run Multiple Applications in a Workspace
Clusters
Spark on Domino
Hadoop and Spark Overview
Connect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
On-Demand Spark Overview
Validated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
On-Demand Ray Overview
Validated Ray VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
On-Demand Dask Overview
Validated Dask VersionConfigure PrerequisitesWork with Your ClusterManage DependenciesWork with Data
Environments
Set a Default EnvironmentCreate an EnvironmentEdit Environment DefinitionView Your EnvironmentsView Environment RevisionsDuplicate an EnvironmentArchive an Environment
Environments
Example: Create a New Environment
Customize Environments
Install Custom Packages with Git Integration
Add Packages to Environments
Use Dockerfile InstructionsUse requirements.txt (Python only)Use the Execution to Add a Package
Add Workspace IDEsAdd a Scala KernelAccess Additional Domains and HostnamesUse TensorBoard in Jupyter Workspaces
Use Partner Environments
Use MATLAB as a WorkspaceUse Stata as a WorkspaceAdd an NVIDIA NGC to DominoUse SAS as a Workspace
Executions
Execution StatesDomino Environment Variables
Jobs
Start a JobScheduled Jobs
Launchers
Launchers OverviewCreate a LauncherRun a LauncherCopy Launcher Definitions
View Job DetailsCompare JobsTag JobsStop JobsView Execution Performance
Execution Notifications
Set Notification PreferencesSet Custom Execution Notifications
Execution Results
Download Execution ResultsCustomize the Results DashboardAutomate Complex Pipelines with Apache Airflow
Model APIs
Configure a Model for Deployment
Scale Models
Scale Python ModelsScale Model Versions
Configure Compute ResourcesRoute Your ModelProject Files in ModelsEnvironments for ModelsShare and Collaborate on Models
Publish
Model APIs
Publish a ModelSend Test Calls to the ModelPublish a New Version of a ModelSelect How to Authorize a Model
Domino Apps
Publish a Domino AppHost HTML Pages from DominoGrant Access to Domino AppsView a Domino AppView All Domino AppsIdentify Resources to WhitelistPublish a Python App with DashPublish an R App with ShinyPublish a Project as a Website with FlaskOptimize App Scalability and PerformanceGet the Domino Username of an App Viewer
Launchers
Create a LauncherRun a LauncherCopy Launcher Definitions
Model Monitoring
Model Monitoring APIsAccessing The Model MonitorGet Started with Model MonitoringModel Monitor DeploymentIngest Data into The Model MonitorModel RegistrationMonitoring Data DriftMonitoring Model QualitySetting Scheduled Checks for the ModelConfigure Notification Channels for the ModelUse Model Monitoring APIsProduct Settings
Domino Command Line Interface (CLI)
Install the Domino Command Line Interface (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between DeploymentsUse the Domino CLI Behind a Proxy
Troubleshooting
Troubleshoot Domino ModelsWork with Many FilesTroubleshoot Imports
Get Help
Additional ResourcesGet Domino VersionContact Technical SupportSupport BundlesBrowser SupportUser Guide Updates
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Workspaces
>
Clusters
>
Spark on Domino
>
Hadoop and Spark Overview
>
Connect to a Cloudera CDH5 cluster from Domino

Connect to a Cloudera CDH5 cluster from Domino

Domino supports connecting to a Cloudera CDH5 cluster through the addition of cluster-specific binaries and configuration files to your Domino environment.

At a high level, the process is as follows:

  1. Connect to your CDH5 edge or gateway node and gather the required binaries and configuration files, then download them to your local machine.

  2. Upload the gathered files into a Domino project to allow access by the Domino environment builder.

  3. Create a new Domino environment that uses the uploaded files to enable connections to your cluster.

  4. Enable YARN integration for the Domino projects that you want to use with the CDH5 cluster.

Domino supports the following types of connections to a CDH5 cluster:

  • FS shell

  • spark2-shell

  • spark2-submit

  • PySpark

  • YARN shell

Gather the required binaries and configuration files

You will find most of the files for setting up your Domino environment on your CDH5 edge or gateway node. To get started, connect to the edge node via SSH, then follow the steps below.

  1. Create a directory named hadoop-binaries-configs at /tmp.

    mkdir /tmp/hadoop-binaries-configs
  2. Create the following subdirectories inside /tmp/hadoop-binaries-configs/.

    mkdir /tmp/hadoop-binaries-configs/configs
    
    mkdir /tmp/hadoop-binaries-configs/parcels
  3. Optional: If your cluster uses Kerberos authentication, create the following subdirectory in /tmp/hadoop-binaries/configs/.

    mkdir /tmp/hadoop-binaries-configs/kerberos

    Then, copy the krb5.conf Kerberos configuration file from /etc/ to /tmp/hadoop-binaries-configs/kerberos.

    cp /etc/krb5.conf /tmp/hadoop-binaries-configs/kerberos/
  4. Copy the CDH and SPARK2 directories from /opt/cloudera/parcels/ to /tmp/hadoop-binaries-configs/parcels/. These directories will have a version number appended to their names, so complete the appropriate directory name in the commands shown below.

    cp -R /opt/cloudera/parcels/CDH-<version>/ /tmp/hadoop-binaries-configs/parcels/
    cp -R /opt/cloudera/parcels/SPARK2-<version>/ /tmp/hadoop-binaries-configs/parcels/
  5. Copy the hadoop, hive, spark, and spark2 directories from /etc/ to /tmp/hadoop-binaries-configs/configs/.

    cp -R /etc/hadoop /tmp/hadoop-binaries-configs/configs/
    cp -R /etc/hive /tmp/hadoop-binaries-configs/configs/
    cp -R /etc/spark2 /tmp/hadoop-binaries-configs/configs/
    cp -R /etc/spark /tmp/hadoop-binaries-configs/configs/
  6. On the edge node, run the following command to identify the version of Java running on the cluster.

    java -version

    You should then download a JDK .tar file from the Oracle downloads page that matches that version. The filename will have a pattern like the following.

    jdk-8u211-linux-x64.tar.gz

    Keep this JDK handy on your local machine for use in a future step.

  7. Compress the /tmp/hadoop-binaries-configs/ directory to a gzip archive.

    cd /tmp
    
    tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs

    When finished, use SCP to download the archive to your local machine.

  8. Next, you’ll need to extract the archive on your local machine, add a java subdirectory, then add the JDK .tar file you downloaded earlier to the java subdirectory.

    tar xzf hadoop-binaries-configs.tar.gz
    
    mkdir hadoop-binaries-configs/java
    
    cp jdk-8u211-linux-x64.tar.gz hadoop-binaries-configs/java/
  9. When finished, your hadoop-binaries-configs directory should have the following structure.

    hadoop-binaries-configs/
      ├── configs/
            ├── hadoop/
            ├── hive/
            ├── spark/
            └── spark2/
      ├── java/
            └── jdk-8u211-linux-x64.tar.gz
      ├── parcels
            ├── CDH-version/
            └── SPARK-version/
      └── kerberos/  # optional
            └── krb5.conf
  10. If your directory contains all the required files, you can now compress it to a gzip archive again in preparation for uploading to Domino in the next step.

    tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs

Upload the binaries and configuration files to Domino

Use the following procedure to upload the archive you created in the previous step to a public Domino project. This will make the file available to the Domino environment builder.

  1. Log in to Domino, then create a new public project.

    Screen Shot 2019 04 01 at 10.47.24 PM

  2. Open the Files page for the new project, then click to browse for files and select the archive you created in the previous section. Then click Upload.

  3. After the archive has been uploaded, click the gear menu next to it on the Files page, then right click Download and click Copy Link Address. Save the copied URL in your notes, as you will need it in the next step.

    After you have recorded the download URL of the archive, you’re ready to build a Domino environment for connecting to your CDH5 cluster.

Create a Domino environment for connecting to CDH5

  1. Click Environments from the Domino main menu, then click Create Environment.

    Screen Shot 2019 04 02 at 10.56.18 AM

  2. Give the environment an informative name, then choose a base environment that includes the version of Python that is installed on the nodes of your CDH5 cluster. Most Linux distributions ship with Python 2.7 by default, so you will see the Domino Analytics Distribution for Python 2.7 used as the base image in the following examples. Click Create when finished.

    Screen Shot 2019 04 02 at 10.56.57 AM

  3. After creating the environment, click Edit Definition. Copy the following example into your Dockerfile Instructions, then be sure to edit it wherever necessary with values specific to your deployment and cluster.

    In this Dockerfile, wherever you see a hyphenated instruction enclosed in carats like <paste-your-domino-download-url-here>, be sure to replace it with the corresponding value you recorded in previous steps.

    You may also need to edit commands that follow to match downloaded filenames.

    USER root
    
    # Give user ubuntu ability to sudo as any user including root
    RUN echo "ubuntu ALL=(ALL:ALL) NOPASSWD: ALL" >> /etc/sudoers
    
    # Set up directories
    RUN mkdir -p /opt/cloudera/parcels && \
        mkdir /tmp/domino-hadoop-downloads && \
        mkdir /usr/java
    
    # Download the binaries and configs gzip you uploaded to Domino.
    # This downloaded gzip file should have the following
    # - CDH and Spark2 parcel directories in a 'parcels' sub-directory.
    # - java installation tar file in 'java' sub-directory
    # - krb5.conf in 'kerberos' sub-directory
    # - hadoop, hive, spark2 and spark config directories a 'configs' sub-directory
    RUN wget --no-check-certiticate <paste-your-domino-download-url-here> -O /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz && \
        tar xzf /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz -C /tmp/domino-hadoop-downloads/
    
    # Install kerberos client and update the kerberos configuration file
    RUN apt-get -y install krb5-user telnet && \
        cp /tmp/domino-hadoop-downloads/hadoop-binaries-configs/kerberos/krb5.conf /etc/krb5.conf
    
    # Install version of Java that matches hadoop cluster and update environment variables
    # Your JDK may have a different filename depending on your cluster's version of Java
    RUN tar xvf /tmp/domino-hadoop-downloads/hadoop-binaries-configs/java/jdk-8u162-linux-x64.tar -C /usr/java
    ENV JAVA_HOME=/usr/java/jdk1.8.0_162
    RUN echo "export JAVA_HOME=/usr/java/jdk1.8.0_162" >> /home/ubuntu/.domino-defaults && \
        echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/ubuntu/.domino-defaults
    
    # Install CDH hadoop-client binaries from cloudera ubuntu trusty repository.
    # This example shows client binaries for CDH version 5.15 here.
    # Update these commands with the CDH version that matches your cluster.
    RUN echo "deb [arch=amd64] http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh trusty-cdh5.15.0 contrib" >> /etc/apt/sources.list.d/cloudera.list && \
        echo "deb-src http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh trusty-cdh5.15.0 contrib" >> /etc/apt/sources.list.d/cloudera.list && \
        wget http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key -O /tmp/domino-hadoop-downloads/archive.key && \
        apt-key add /tmp/domino-hadoop-downloads/archive.key && \
        apt-get update && \
        apt-get -y -t trusty-cdh5.15.0 install zookeeper && \
        apt-get -y -t trusty-cdh5.15.0 install hadoop-client
    
    # Copy CDH and Spark2 parcels to correct directories and update symlinks
    # Note that the version strings attached to your directory names may be different than the below examples.
    RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21 /opt/cloudera/parcels/ && \
        mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809 /opt/cloudera/parcels/ && \
        ln -s /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21 /opt/cloudera/parcels/CDH && \
        ln -s /opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809 /opt/cloudera/parcels/SPARK2
    
    # Copy hadoop, hive and spark2 configurations
    RUN mv /etc/hadoop /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop-etc-local.backup && \
        mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop /etc/hadoop && \
        mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hive /etc/hive && \
        mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark2 /etc/spark2 && \
        mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark /etc/spark
    
    # Create alternatives for hadoop configurations. Update the extensions with the same strings as found in your edge node
    # Example: In the command 'update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 55'
    # make sure that /etc/hadoop/conf.cloudera.yarn is named the same as the corresponding file on your edge node.
    # Sometimes in the CDH5 edgenode, that is named something like /etc/hadoop/conf.cloudera.yarn_
    RUN update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 55 && \
        update-alternatives --install /etc/hive/conf hive-conf /etc/hive/conf.cloudera.hive 55 && \
        update-alternatives --install /etc/spark2/conf spark2-conf /etc/spark2/conf.cloudera.spark2_on_yarn 55 && \
        update-alternatives --install /etc/spark/conf spark-conf /etc/spark/conf.cloudera.spark_on_yarn 55
    
    # These instructions are for Spark2
    # Creating alternatives for Spark2 binaries, also create symlink for pyspark pointing to pyspark2
    RUN update-alternatives --install /usr/bin/spark2-shell spark2-shell /opt/cloudera/parcels/SPARK2/bin/spark2-shell 55 && \
        update-alternatives --install /usr/bin/spark2-submit spark2-submit /opt/cloudera/parcels/SPARK2/bin/spark2-submit 55 && \
        update-alternatives --install /usr/bin/pyspark2 pyspark2 /opt/cloudera/parcels/SPARK2/bin/pyspark2 55 && \
        ln -s /usr/bin/pyspark2 /usr/bin/pyspark
    
    # Update SPARK and HADOOP environment variables. Make sure py4j file name is correct per your edgenode
    ENV SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
    RUN echo "export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop" >> /home/ubuntu/.domino-defaults && \
        echo "export HADOOP_CONF_DIR=/etc/hadoop/conf" >> /home/ubuntu/.domino-defaults && \
        echo "export YARN_CONF_DIR=/etc/hadoop/conf" >> /home/ubuntu/.domino-defaults && \
        echo "export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2" >> /home/ubuntu/.domino-defaults && \
        echo "export SPARK_CONF_DIR=/etc/spark2/conf" >> /home/ubuntu/.domino-defaults && \
        echo "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip" >> /home/ubuntu/.domino-defaults
    
    # Change spark-defaults.conf file permission
    RUN mv /etc/spark2/conf/spark-defaults.conf /etc/spark2/ && \
        chmod 777 /etc/spark2/conf.cloudera.spark2_on_yarn
    
    # Copy hive-site.xml to /etc/spark2/conf to access hive tables from Spark2.
    RUN cp /etc/spark2/conf/yarn-conf/hive-site.xml /etc/spark2/conf/
    
    USER ubuntu
  4. Scroll down to the Pre Run Script field and add the following lines.

    cat /etc/spark2/spark-defaults.conf >> /etc/spark2/conf/spark-defaults.conf
    sed -i.bak '/spark.ui.port\=0/d' /etc/spark2/conf/spark-defaults.conf
  5. Scroll down and click Advanced to expand additional fields. Add the following line to the Post Setup Script field.

    echo "export YARN_CONF_DIR=/etc/hadoop/conf" >> /home/ubuntu/.bashrc
  6. Click Build when finished editing the Dockerfile instructions. If the build completes successfully, you are ready to try using the environment.

Configure a Domino project for use with a CDH5 cluster

This procedure assumes that an environment with the necessary client software has been created according to the instructions above. Ask your Domino admin for access to such an environment.

  1. Open the Domino project you want to use with your CDH5 cluster, then click Settings from the project menu.

  2. On the Integrations tab, click to select YARN integration from the Apache Spark panel, then click Save. You do not need to edit any of the fields in this section.

  3. If your cluster uses Kerberos authentication, you can configure credentials at the user level or project level . Do so before attempting to use the environment. Note that if you followed the instructions above on creating your environment, your Kerberos configuration file has already been added to it.

  4. On the Hardware & Environment tab, change the project default environment to the one with the cluster’s binaries and configurations files installed.

You are now ready to start Runs from this project that interact with your CDH5 cluster.

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.