Domino supports connecting to an Amazon EMR cluster through the addition of cluster-specific binaries and configuration files to your Domino environment.
At a high level, the process is as follows:
-
Connect to the EMR Master Node and gather the required binaries and configuration files needed by Domino.
-
Create a new Domino environment that uses the uploaded files to enable connections to your cluster.
-
Enable YARN integration for the Domino projects that you want to use with the EMR cluster.
Domino supports the following types of connections to an EMR cluster:
These instructions are written with the following requirements:
-
Domino needs to be routable from the EMR cluster by private EC2 IP. This can be achieved by launching EMR directly into Domino’s VPC or via VPC Peering.
-
Your security groups are configured to allow traffic between EMR and Domino. The Domino node security group, the EMR Master Node, and the EMR Worker Node security groups all need to allow TCP traffic between them.
You will find the necessary files for setting up your Domino environment on the EMR Master Node. To get started, connect to your Master Node via SSH.
After you are connected to the Master Node, use vi
or the editor of your choice
to create a script called domino-emr-config-maker.sh
. Copy in the
following code and save the script.
Note
| This script also captures Hive binaries. If you do not have Hive installed on your EMR cluster, you may need to remove the Hive-related commands from this script. |
#!/bin/bash
rm -rf www
rm -rf /tmp/hadoop-binaries-configs
mkdir -p www
mkdir -p /tmp/hadoop-binaries-configs/configs
cp -rL /etc/hadoop /tmp/hadoop-binaries-configs/configs
cp -rL /etc/hive /tmp/hadoop-binaries-configs/configs
cp -rL /etc/spark /tmp/hadoop-binaries-configs/configs
cp -r /usr/lib/hadoop /tmp/hadoop-binaries-configs
cp -r /usr/lib/hadoop-lzo /tmp/hadoop-binaries-configs
cp -r /usr/lib/spark /tmp/hadoop-binaries-configs
cp -r /usr/share/aws /tmp/hadoop-binaries-configs
cp -r /usr/share/java /tmp/hadoop-binaries-configs
cd /tmp/hadoop-binaries-configs/configs/hadoop/conf/
sed -i '$ d' hdfs-site.xml
echo "<property>" >> hdfs-site.xml
echo "<name>dfs.client.use.datanode.hostname</name>" >> hdfs-site.xml
echo "<value>true</value>" >> hdfs-site.xml
echo "</property>" >> hdfs-site.xml
echo "</configuration>" >> hdfs-site.xml
cd /tmp
tar -zcf hadoop-binaries-configs.tar.gz hadoop-binaries-configs
cd ~
mv /tmp/hadoop-binaries-configs.tar.gz www/
cd www
/usr/bin/python3 -m http.server
This script bundles together all of the required binaries and configurations and serves it via a webserver on port 8000 of the Master Node. You will need to open port 8000 in your cluster’s security group if you have not already.
Before moving on, note the private IP address of your EMR Master Node. This will be available in your connection’s prompt.
Execute this script via the command bash domino-emr-config-maker.sh
to
begin the bundling and launch the webserver. You will want to leave your
ssh
connection to the Master Node open while finishing the rest of
this setup.
-
Create a new Domino environment with the latest version of the Domino Analytics Distribution as its base image.
-
Edit this environment, and add the following code to the environment’s Dockerfile. Be sure to replace
<MASTER_NODE_PRIVATE_IP>
with the private IP address you noted earlier.ENV EMR_MASTER_PRIVATE_IP <MASTER_NODE_PRIVATE_IP> USER root RUN echo "ubuntu ALL=(ALL:ALL) NOPASSWD: ALL" >> /etc/sudoers RUN mkdir /tmp/domino-hadoop-downloads # Download the binaries and configs gzip from EMR master. # # This downloaded gzip archive should contain a configs directory with # hadoop, hive, and spark subdirectories directories. # # You may need to edit this depending on where you are running the web server on your EMR master. RUN wget -q http://$EMR_MASTER_PRIVATE_IP:8000/hadoop-binaries-configs.tar.gz -O /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz && tar xzf /tmp/domino-hadoop-downloads/hadoop-binaries-configs.tar.gz -C /tmp/domino-hadoop-downloads/ RUN cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hadoop /etc/hadoop && cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/hive /etc/hive && cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/configs/spark /etc/spark RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/aws /usr/share/aws RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/hadoop /usr/lib/hadoop RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/hadoop-lzo /usr/lib/hadoop-lzo RUN mv /tmp/domino-hadoop-downloads/hadoop-binaries-configs/spark /usr/lib/spark RUN cp -r /tmp/domino-hadoop-downloads/hadoop-binaries-configs/java/* /usr/share/java/ RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> /home/ubuntu/.domino-defaults && echo 'export HADOOP_HOME=/usr/lib/hadoop' >> /home/ubuntu/.domino-defaults && echo 'export SPARK_HOME=/usr/lib/spark' >> /home/ubuntu/.domino-defaults && echo 'export PYTHONPATH=${PYTHONPATH:-}:${SPARK_HOME:-}/python/' >> /home/ubuntu/.domino-defaults && echo 'export PYTHONPATH=${PYTHONPATH:-}:${SPARK_HOME:-}/python/lib/py4j-0.10.7-src.zip' >> /home/ubuntu/.domino-defaults && echo 'export PATH=${PATH:-}:${SPARK_HOME:-}/bin' >> /home/ubuntu/.domino-defaults && echo 'export PATH=${PATH:-}:${HADOOP_HOME:-}/bin' >> /home/ubuntu/.domino-defaults && echo 'export HADOOP_CONF_DIR=/etc/hadoop/conf' >> /home/ubuntu/.domino-defaults && echo 'export YARN_CONF_DIR=/etc/hadoop/conf' >> /home/ubuntu/.domino-defaults && echo 'export SPARK_CONF_DIR=/etc/spark/conf' >> /home/ubuntu/.domino-defaults RUN mkdir -p /var/log/spark/user/ubuntu RUN chown ubuntu:ubuntu /var/log/spark/user/ubuntu RUN chmod -R 777 /usr/lib/spark/conf USER ubuntu
-
Build the new revision of the environment by clicking the button on the bottom of the page. You may want to follow along by viewing the Build Logs of the build, accessible from the Revisions table of the environment.
NoteIf the build hangs or fails, you may need to adjust the inbound rules of your security groups as described at the start of this doc. -
After the environment builds successfully, you may stop the webserver on your EMR Master Node that was launched earlier and close the
ssh
connection.
This procedure assumes that an environment with the necessary client software has been created according to the instructions above. Ask your Domino admin for access to such an environment. Note that you may need to provide Domino with additional options when setting up your project. Your Domino or AWS administrators should be able to provide you with the correct values for these options.
-
Open the Domino project you want to use with your EMR cluster, then click Settings from the project menu.
-
On the Integrations tab, click to select YARN integration from the Apache Spark panel.
-
Use
root
as the Hadoop username. -
If your work with the cluster generates many warnings about missing Java packages, you can suppress these by adding the following to Spark Configuration Options.
Key:
spark.hadoop.yarn.timeline-service.enabled
Value:
false
-
After inputting your YARN configuration, click Save.
-
On the Hardware & Environment tab, change the project default environment to the one you built earlier with the binaries and configuration files.
You are now ready to start Runs from this project that interact with your EMR cluster.