domino logo
Tech Ecosystem
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get Started with MATLAB
Step 1: Orient yourself to DominoStep 2: Create a Domino ProjectStep 3: Configure Your Domino ProjectStep 4: Start a MATLAB WorkspaceStep 5: Fetch and Save Your DataStep 6: Develop Your ModelStep 7: Clean Up Your Workspace
Step 8: Deploy Your Model
Scheduled JobsLaunchers
Step 9: Working with Domino Datasets
Domino Reference
Projects
Projects OverviewProjects PortfolioReference ProjectsProject Goals in Domino 4+
Git Integration
Git Repositories in DominoGit-based ProjectsWorking from a Commit ID in Git
Jira Integration in DominoUpload Files to Domino using your BrowserFork and Merge ProjectsSearchSharing and CollaborationCommentsDomino File SystemCompare File Revisions
Revert Projects and Files
Revert a FileRevert a Project
Archive a Project
Advanced Project Settings
Project DependenciesProject TagsRename a ProjectSet up your Project to Ignore FilesUpload files larger than 550MBExporting Files as a Python or R PackageTransfer Project Ownership
Domino Runs
JobsDiagnostic Statistics with dominostats.jsonNotificationsResultsRun Comparison
Advanced Options for Domino Runs
Run StatesDomino Environment VariablesEnvironment Variables for Secure Credential StorageUse Apache Airflow with Domino
Scheduled Jobs
Domino Workspaces
WorkspacesUse Git in Your WorkspaceRecreate A Workspace From A Previous CommitUse Visual Studio Code in Domino WorkspacesPersist RStudio PreferencesAccess Multiple Hosted Applications in one Workspace Session
Spark on Domino
On-Demand Spark
On-Demand Spark OverviewValidated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
External Hadoop and Spark
Hadoop and Spark OverviewConnect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
On-Demand Ray
On-Demand Ray OverviewValidated Ray VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
On-Demand Dask
On-Demand Dask OverviewValidated Dask VersionConfigure PrerequisitesWork with Your ClusterManage DependenciesWork with Data
Customize the Domino Software Environment
Environment ManagementDomino Standard EnvironmentsInstall Packages and DependenciesAdd Workspace IDEsAdding Jupyter Kernels
Partner Environments for Domino
Use MATLAB as a WorkspaceUse Stata as a WorkspaceUse SAS as a WorkspaceNVIDIA NGC Containers
Advanced Options for Domino Software Environment
Install Custom Packages in Domino with Git IntegrationAdd Custom DNS Servers to Your Domino EnvironmentConfigure a Compute Environment to User Private Cran/Conda/PyPi MirrorsUse TensorBoard in Jupyter Workspaces
Publish your Work
Publish a Model API
Model Publishing OverviewModel Invocation SettingsModel Access and CollaborationModel Deployment ConfigurationPromote Projects to ProductionExport Model Image
Publish a Web Application
App Publishing OverviewGet Started with DashGet Started with ShinyGet Started with FlaskContent Security Policies for Web Apps
Advanced Web Application Settings in Domino
App Scaling and PerformanceHost HTML Pages from DominoHow to Get the Domino Username of an App Viewer
Launchers
Launchers OverviewAdvanced Launcher Editor
Assets Portfolio Overview
Model Monitoring and Remediation
Monitor WorkflowsData Drift and Quality Monitoring
Set up Monitoring for Model APIs
Set up Prediction CaptureSet up Drift DetectionSet up Model Quality MonitoringSet up NotificationsSet Scheduled ChecksSet up Cohort Analysis
Set up Model Monitor
Connect a Data SourceRegister a ModelSet up Drift DetectionSet up Model Quality MonitoringSet up Cohort AnalysisSet up NotificationsSet Scheduled ChecksUnregister a Model
Use Monitoring
Access the Monitor DashboardAnalyze Data DriftAnalyze Model QualityExclude Features from Scheduled Checks
Remediation
Cohort Analysis
Review the Cohort Analysis
Remediate a Model API
Monitor Settings
API TokenHealth DashboardNotification ChannelsTest Defaults
Monitoring Config JSON
Supported Binning Methods
Model Monitoring APIsTroubleshoot the Model Monitor
Connect to your Data
Data in Domino
Datasets OverviewProject FilesDatasets Best Practices
Connect to Data Sources
External Data VolumesDomino Data Sources
Connect to External Data
Connect to Amazon S3 from DominoConnect to BigQueryConnect to DataRobotConnect to Generic S3 from DominoConnect to IBM DB2Connect to IBM NetezzaConnect to ImpalaConnect to MSSQLConnect to MySQLConnect to OkeraConnect to Oracle DatabaseConnect to PostgreSQLConnect to RedshiftConnect to Snowflake from DominoConnect to Teradata
Work with Data Best Practices
Work with Big Data in DominoWork with Lots of FilesMove Data Over a Network
Advanced User Configuration Settings
User API KeysDomino TokenOrganizations Overview
Use the Domino Command Line Interface (CLI)
Install the Domino Command Line (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between Domino DeploymentsUse the Domino CLI Behind a Proxy
Browser Support
Get Help with Domino
Additional ResourcesGet Domino VersionContact Domino Technical SupportSupport Bundles
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Domino Reference
>
Spark on Domino
>
External Hadoop and Spark
>
Connect to an Amazon EMR cluster from Domino

Connect to an Amazon EMR cluster from Domino

Domino supports connecting to an Amazon EMR cluster through the addition of cluster-specific binaries and configuration files to your Domino environment.

At a high level, the process is as follows:

  1. Connect to the EMR Master Node and gather the required binaries and configuration files needed by Domino.

  2. Create a new Domino environment that uses the uploaded files to enable connections to your cluster.

  3. Enable YARN integration for the Domino projects that you want to use with the EMR cluster.

Domino supports the following types of connections to an EMR cluster:

  • FS shell

  • spark-shell

  • spark-submit

  • pyspark

Requirements

These instructions are written with the following requirements:

  • Domino needs to be routable from the EMR cluster by private EC2 IP. This can be achieved by launching EMR directly into Domino’s VPC or via VPC Peering.

  • Your security groups are configured to allow traffic between EMR and Domino. The Domino node security group, the EMR Master Node, and the EMR Worker Node security groups all need to allow TCP traffic between them.

Gather and serve the required binaries and configuration files

You will find the necessary files for setting up your Domino environment on the EMR Master Node. To get started, connect to your Master Node via SSH.

After connected to the Master Node, use vi or the editor of your choice to create a script called domino-emr-config-maker.sh. Copy in the following code and save the script.

Note
#!/bin/bash

rm -rf www
rm -rf /tmp/hadoop-binaries-configs

mkdir -p www
mkdir -p /tmp/hadoop-binaries-configs/configs

cp -rL /etc/hadoop /tmp/hadoop-binaries-configs/configs
cp -rL /etc/hive /tmp/hadoop-binaries-configs/configs
cp -rL /etc/spark /tmp/hadoop-binaries-configs/configs

cp -r /usr/lib/hadoop /tmp/hadoop-binaries-configs
cp -r /usr/lib/hadoop-lzo /tmp/hadoop-binaries-configs
cp -r /usr/lib/spark /tmp/hadoop-binaries-configs
cp -r /usr/share/aws /tmp/hadoop-binaries-configs
cp -r /usr/share/java /tmp/hadoop-binaries-configs

cd /tmp/hadoop-binaries-configs/configs/hadoop/conf/
sed -i '$ d'  hdfs-site.xml
echo "<property>" >> hdfs-site.xml
echo "<name>dfs.client.use.datanode.hostname</name>" >> hdfs-site.xml
echo "<value>true</value>" >> hdfs-site.xml
echo "</property>" >> hdfs-site.xml
echo "</configuration>" >> hdfs-site.xml

cd /tmp
tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs
cd ~
mv /tmp/hadoop-binaries-configs.tar.gz www/
cd www
/usr/bin/python3 -m http.server

This script bundles together all of the required binaries and configurations and serves it via a webserver on port 8000 of the Master Node. You will need to open port 8000 in your cluster’s security group if you have not already.

Before moving on, note the private IP address of your EMR Master Node. This will be available in your connection’s prompt.

Execute this script via the command bash domino-emr-config-maker.sh to begin the bundling and launch the webserver. You will want to leave your ssh connection to the Master Node open while finishing the rest of this setup.

Create a Domino environment for connecting to EMR

  1. Create a new Domino environment with the latest version of the Domino Analytics Distribution as its base image.

  2. Edit this environment, and add the following code to the environment’s Dockerfile. Be sure to replace <MASTER_NODE_PRIVATE_IP> with the private IP address you noted earlier.

    ENV EMR_MASTER_PRIVATE_IP <MASTER_NODE_PRIVATE_IP>
    
    USER root
    RUN echo "ubuntu ALL=(ALL:ALL) NOPASSWD: ALL" >> /etc/sudoers
    
    RUN mkdir /tmp/domino-hadoop-downloads
    
    # Download the binaries and configs gzip from EMR master.
    #
    # This downloaded gzip archive should contain a configs directory with
    # hadoop, hive, and spark subdirectories directories.
    #
    # You may need to edit this depending on where you are running the web server on your EMR master.
    RUN wget -q http://$EMR_MASTER_PRIVATE_IP:8000/hadoop-binaries-configs.tar.gz -O /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz && \
        tar xzf /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz -C /tmp/domino-hadoop-downloads/
    
    RUN cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop /etc/hadoop && \
        cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hive /etc/hive && \
        cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark /etc/spark
    
    RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/aws /usr/share/aws
    RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/hadoop /usr/lib/hadoop
    RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/hadoop-lzo /usr/lib/hadoop-lzo
    RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/spark /usr/lib/spark
    RUN cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/java/* /usr/share/java/
    
    RUN \
    echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> /home/ubuntu/.domino-defaults && \
    echo 'export HADOOP_HOME=/usr/lib/hadoop' >> /home/ubuntu/.domino-defaults && \
    echo 'export SPARK_HOME=/usr/lib/spark' >> /home/ubuntu/.domino-defaults && \
    echo 'export PYTHONPATH=${PYTHONPATH:-}:${SPARK_HOME:-}/python/' >> /home/ubuntu/.domino-defaults && \
    echo 'export PYTHONPATH=${PYTHONPATH:-}:${SPARK_HOME:-}/python/lib/py4j-0.10.7-src.zip' >> /home/ubuntu/.domino-defaults && \
    echo 'export PATH=${PATH:-}:${SPARK_HOME:-}/bin' >> /home/ubuntu/.domino-defaults && \
    echo 'export PATH=${PATH:-}:${HADOOP_HOME:-}/bin' >> /home/ubuntu/.domino-defaults && \
    echo 'export HADOOP_CONF_DIR=/etc/hadoop/conf' >> /home/ubuntu/.domino-defaults && \
    echo 'export YARN_CONF_DIR=/etc/hadoop/conf' >> /home/ubuntu/.domino-defaults && \
    echo 'export SPARK_CONF_DIR=/etc/spark/conf' >> /home/ubuntu/.domino-defaults
    RUN mkdir -p /var/log/spark/user/ubuntu
    RUN chown ubuntu:ubuntu /var/log/spark/user/ubuntu
    RUN chmod -R 777 /usr/lib/spark/conf
    
    USER ubuntu
  3. Build the new revision of the environment by clicking the button on the bottom of the page. You may want to follow along by viewing the Build Logs of the build, accessible from the Revisions table of the environment.

    Note
  4. After the environment builds successfully, you may stop the webserver on your EMR Master Node that was launched earlier and close the ssh connection.

Configure a Domino project for use with an EMR cluster

This procedure assumes that an environment with the necessary client software has been created according to the instructions above. Ask your Domino admin for access to such an environment. Note that you may need to provide Domino with additional options when setting up your project. Your Domino or AWS administrators should be able to provide you with the correct values for these options.

  1. Open the Domino project you want to use with your EMR cluster, then click Settings from the project menu.

  2. On the Integrations tab, click to select YARN integration from the Apache Spark panel.

  3. Use root as the Hadoop username.

  4. If your work with the cluster generates many warnings about missing Java packages, you can suppress these by adding the following to Spark Configuration Options.

    Key: spark.hadoop.yarn.timeline-service.enabled

    Value: false

  5. After inputting your YARN configuration, click Save.

  6. On the Hardware & Environment tab, change the project default environment to the one you built earlier with the binaries and configuration files.

You are now ready to start Runs from this project that interact with your EMR cluster.

Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.