Connect to an Amazon EMR cluster from Domino

Domino supports connecting to an Amazon EMR cluster through the addition of cluster-specific binaries and configuration files to your Domino environment.

At a high level, the process is as follows:

Connect to the EMR Master Node and gather the required binaries and configuration files, then download them to your local machine.
Upload the gathered files into a Domino project to allow access by the Domino environment builder.
Create a new Domino environment that uses the uploaded files to enable connections to your cluster.
Enable YARN integration for the Domino projects that you want to use with the EMR cluster.

Domino supports the following types of connections to an EMR cluster:

Gathering the required binaries and configuration files

You will find the necessary files for setting up your Domino environment on the EMR Master Node. To get started, connect to your Master Node via SSH, then follow the steps below.

Create a directory named hadoop-binaries-configs at /tmp.
```
mkdir /tmp/hadoop-binaries-configs
```

Create a subdirectory named configs in /tmp/hadoop-binaries-configs and create the listed subdirectories inside it.

mkdir /tmp/hadoop-binaries-configs/configs
mkdir -p /tmp/hadoop-binaries-configs/configs/hadoop
mkdir -p /tmp/hadoop-binaries-configs/configs/hive
mkdir -p /tmp/hadoop-binaries-configs/configs/spark

Copy the contents of the hive, spark, and hadoop directories from /etc to /tmp/hadoop-binaries-configs/configs.

cp -rL /etc/hadoop/conf /tmp/hadoop-binaries-configs/configs/hadoop
cp -rL /etc/hive/conf /tmp/hadoop-binaries-configs/configs/hive
cp -rL /etc/spark/conf /tmp/hadoop-binaries-configs/configs/spark

Copy the following additional directories to /tmp/hadoop-binaries-configs.

cp -r /usr/lib/hadoop /tmp/hadoop-binaries-configs
cp -r /usr/lib/hadoop-lzo /tmp/hadoop-binaries-configs
cp -r /usr/lib/spark /tmp/hadoop-binaries-configs
cp -r /usr/share/aws /tmp/hadoop-binaries-configs
cp -r /usr/share/java /tmp/hadoop-binaries-configs

Add the following lines to the end of the newly copied file at /tmp/hadoop-binaries-configs/configs/hadoop/conf/hdfs-site.xml. This is necessary since the Domino executor will not be able to connect to HDFS on the same private IPs that the Master Node uses.
```
cd /tmp/hadoop/binaries/configs/hadoop/conf/
echo "<property>" >> hdfs-site.xml
echo "<name>dfs.client.use.datanode.hostname</name>" >> hdfs-site.xml
echo "<value>true</value>" >> hdfs-site.xml
echo "</property>" >> hdfs-site.xml
```
(Optional) If your EMR cluster uses Kerberos authentication, create a subdirectory named kerberos at /tmp/hadoop-binaries-configs.
```
mkdir /tmp/hadoop-binaries-configs/kerberos
```
Then copy the Kerberos configuration file krb5.conf from /etc to /tmp/hadoop-binaries-configs/kerberos.
```
cp /etc/krb5.conf /tmp/hadoop-binaries-configs/kerberos/
```
Once you’ve copied and edited all of the above files into /tmp/hadoop-binaries-configs, zip up the directory for transfer to your local machine.
```
cd /tmp
tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs
```
Then use SCP from your local machine to download the zipped archive. Refer back to the AWS documentation on connecting to a Master Node via SSH for credentialing and address information.

Uploading the binaries and configuration files to Domino

Use the following procedure to upload the files you retrieved in the previous step to a public Domino project. This will make the files available to the Domino environment builder.

Log in to Domino, then create a new public project.
Open the Files page for the new project, then click to browse for files and select the archive of binaries and configuration files you downloaded from the EMR Master Node. Then click Upload.
After your upload has completed, click the gear menu next to the uploaded file, then right click Download and click Copy Link Address. Save this URL in your notes, as you will need it in the next step.

Once you have recorded the download URL of the binaries and configuration files archive, you’re ready to build a Domino environment for connecting to EMR.

Creating a Domino environment for connecting to EMR

First, you need to visit the Spark downloads page to copy a download URL for the Spark binaries. Use the dropdown menus to select the correct version of the binaries for your EMR cluster, then right click the download link and click Copy Link Address. Record the copied URL for use in a later step.
Click Environments from the Domino main menu, then click Create Environment.
Give the environment an informative name, then choose a base environment that includes the version of Python that is installed on the nodes of your EMR cluster. Most Linux distributions ship with Python 2.7 by default, so you will see the Domino Analytics Distribution for Python 2.7 used as the base image in the following examples. Click Create when finished.

After creating the environment, click Edit Definition. Copy the below example into your Dockerfile Instructions, then be sure to edit it wherever necessary with values specific to your deployment and cluster.

In this Dockerfile, wherever you see a hyphenated instruction enclosed in carats like <paste-your-domino-download-url-here>, be sure to replace it with the corresponding value you recorded in previous steps. You may also need to edit commands that follow to match downloaded filenames.

USER root

# Give ubuntu user ability to sudo as any user including root
RUN echo "ubuntu ALL=(ALL:ALL) NOPASSWD: ALL" >> /etc/sudoers

# Set up directories.
RUN mkdir /tmp/domino-hadoop-downloads

# Download the binaries and configs gzip from Domino project.
#
# This downloaded gzip archive should contain a configs directory with
# hadoop, hive, and spark subdirectories directories.
#
# Make sure the URL is edited to reflect where you uploaded your configs.
# You should have this saved from previous steps.
RUN wget --no-check-certificate <paste-your-domino-file-download-url-here> -O /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz &&
    tar xzf /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz -C /tmp/domino-hadoop-downloads/

### Copy hadoop, hive, and spark configurations
RUN cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop /etc/hadoop &&
    cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hive /etc/hive &&
    cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark /etc/spark


### Set correct hadoop and spark config directory names. Sometimes the directories in EMR cluster are named as conf.dist. Check your EMR cluster for right names for each of the configs
RUN rm /etc/hadoop/conf &&
    rm /etc/spark/conf &&
    rm /etc/hive/conf &&
    mv /etc/hadoop/conf.empty /etc/hadoop/conf &&
    mv /etc/spark/conf.empty /etc/spark/conf &&
    mv /etc/hive/conf.empty /etc/hive/conf


### Copy emr jars to the right locations
RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/aws /usr/share/aws
RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/hadoop /usr/lib/hadoop
RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/hadoop-lzo /usr/lib/hadoop-lzo
RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/spark /usr/lib/spark
RUN cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/java/* /usr/share/java/

### Update SPARK and HADOOP environment variables. Make sure py4j file name is correct as per your edgenode
### Update SPARK and HADOOP environment variables. Make sure py4j file name is correct as per your edgenode
RUN
echo 'export JAVA_HOME=/usr/lib/jvm/java-8-oracle/' >> /home/ubuntu/.domino-defaults &&
echo 'export HADOOP_HOME=/usr/lib/hadoop' >> /home/ubuntu/.domino-defaults &&
echo 'export SPARK_HOME=/usr/lib/spark' >> /home/ubuntu/.domino-defaults &&
echo 'export PYTHONPATH=${PYTHONPATH:-}:${SPARK_HOME:-}/python/' >> /home/ubuntu/.domino-defaults &&
echo 'export PYTHONPATH=${PYTHONPATH:-}:${SPARK_HOME:-}/python/lib/py4j-0.10.7-src.zip' >> /home/ubuntu/.domino-defaults &&
echo 'export PATH=${PATH:-}:${SPARK_HOME:-}/bin' >> /home/ubuntu/.domino-defaults &&
echo 'export PATH=${PATH:-}:${HADOOP_HOME:-}/bin' >> /home/ubuntu/.domino-defaults &&
echo 'export HADOOP_CONF_DIR=/etc/hadoop/conf' >> /home/ubuntu/.domino-defaults &&
echo 'export YARN_CONF_DIR=/etc/hadoop/conf' >> /home/ubuntu/.domino-defaults &&
echo 'export SPARK_CONF_DIR=/etc/spark/conf' >> /home/ubuntu/.domino-defaults

Click Build when finished editing the Dockerfile instructions. If the build completes successfully, you are ready to try using the environment.

Configure a Domino project for use with an EMR cluster

This procedure assumes that an environment with the necessary client software has been created according to the instructions above. Ask your Domino admin for access to such an environment. Note that you may need to provide Domino with additional options when setting up your project. Your Domino or AWS administrators should be able to provide you with the correct values for these options.

Open the Domino project you want to use with your EMR cluster, then click Settings from the project menu.
On the Integrations tab, click to select YARN integration from the Apache Spark panel.
Use root as the Hadoop username.
If your EMR cluster is in the same AWS VPC as your Domino deployment, you do not need to list the hosts in the Custom /etc/hosts entries field. If your Domino deployment is in a separate network from the EMR cluster, list the hostnames of the nodes in your cluster.

If your work with the cluster generates many warnings about missing Java packages, you can suppress these by adding the following to Spark Configuration Options.
```
Key: spark.hadoop.yarn.timeline-service.enabled
Value: false
```
After inputting your YARN configuration, click Save.
On the Hardware & Environment tab, change the project default environment to the one you built earlier with the binaries and configuration files.

You are now ready to start Runs from this project that interact with your EMR cluster.