Connecting to S3 from Domino




Overview

This article describes how to connect to Amazon Simple Storage Service (S3) from Domino.

S3 is a cloud object store available as a service from AWS.




Credential Setup

There are two main ways to authenticate with S3 from Domino. Both methods follow the common naming convention of environment variables for AWS packages so that you do not need to explicitly reference credentials in you code.

  1. Using a short-lived credential file obtained via Domino’s AWS Credential Propagation feature.

Once this feature has been configured by your administrator, Domino will automatically populate any run or job with your AWS credentials file. These credentials will be periodically refreshed throughout the duration of the workspace to make sure they don’t expire.

Following common AWS conventions, you will see an environment variable AWS_SHARED_CREDENTIALS_FILE which contains the location to your credential files which will be placed at /var/lib/domino/home/.aws/credentials.

Learn more about using a credential file with the AWS SDK.

  1. Storing AWS your access keys securely as environment variables.

In order to connect to the S3 buckets your AWS account has access to, you’ll need to provide your AWS Access Key and AWS Secret Key to the AWS CLI. By default AWS utilities will look for these in your environment variables.

You should set the following as Domino environment variables on your user account:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

Read Environment Variables for Secure Credential Storage to learn more about Domino environment variables.


Getting a file from an S3-hosted public path

If you have files in S3 that are set to allow public read access, you can fetch those files with Wget from the OS shell of a Domino executor, the same way you would for any other resource on the public Internet. The request for those files will look similar to this:

wget https://s3-<region>.amazonaws.com/<bucket-name>/<filename>

This method is very simple, but doesn’t allow for any authentication or authorization, and should not be used with sensitive data.




AWS CLI

A more secure method of reading S3 from the OS shell of a Domino executor is the AWS CLI. Making the AWS CLI work from your executor is a two-step process. You need to install it in your environment, and provide it with your credentials.


Environment setup

AWS CLI is available as a Python packag from pip. The Dockerfile instruction below is what you’ll need to install the CLI and automatically add it to your system PATH. This instruction assumes you already have pip installed.

RUN pip install awscli --upgrade

For a basic introduction to modifying Domino environments, watch this tutorial video.


Usage

Once your Domino environment and credentials are set up correctly, you can fetch the contents of an S3 bucket to your current directory by running:

aws s3 sync s3://<bucket-name> .

If you are using an AWS credential file with multiple profiles, you might need to specify the profile. (the “default” profile is assumed if none is specified)

aws s3 sync s3://<bucket-name> . --profile <profile name>

Read the official AWS CLI documentation on S3 for more commands and options.




Python and boto3

The best available library for interacting with AWS services from Python is boto3, which has been officially supported by Amazon since 2012.


Environment setup

If you’re using one of the Domino standard environments, boto3 will already be installed. If you want to add boto3 to an environment, use the following Dockerfile instruction.

This instruction assumes you already have pip installed.

RUN pip install boto3

For a basic introduction to modifying Domino environments, watch this tutorial video.


Usage

There are many methods for interacting with S3 from boto3 detailed in the official documentation. Below is a simple example for downloading a file where:

  • you have set up your credentials as instructed above
  • your account has access to an S3 bucket named my_bucket
  • the bucket contains an object named some_data.csv
import boto3
import io
import pandas as pd

# create new S3 client
client = boto3.client('s3')

# download some_data.csv from my_bucket and write to ./some_data.csv locally
file = client.download_file('my_bucket', 'some_data.csv', './some_data.csv')

Alternatively, for users using a credential file.

import boto3

#Specify your profile if you are credential file contains multiple profiles
session = boto3.Session(profile_name='<profile name>)

#Specify your bucket name
users_bucket = session.resource('s3').Bucket('my_bucket')

# 'list' bucket should succeed

for obj in users_bucket.objects.all():
   print(obj.key)


#download a file
users_bucket.download_file('some_data.csv', './some_data.csv')

Note that this code does not provide credentials as arguments to the client constructor, since it assumes either:

  • credentials will be automatically populated at /var/lib/domino/home/.aws/credentials as specified in the environment variable AWS_SHARED_CREDENTIALS_FILE
  • you have already set up credentials in the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

After running the above code, you would expect a local copy of some_data.csv to now exist in the same directory as your Python script or notebook. You could follow this up by loading the data into a pandas dataframe.

df = pd.read_csv('some_data.csv')

Check out part 1 of the First steps in Domino tutorial for a more detailed example of working with CSV data in Python.




R and aws.s3

The cloudyr project offers a package called aws.s3 for interacting with S3 from R.


Environment setup

If you’re using one of the Domino standard environments, aws.s3 will already be installed. If you want to add aws.s3 to an environment, use the following Dockerfile instructions.

RUN R -e 'install.packages(c("httr","xml2"), repos="https://cran.r-project.org")'
RUN R -e 'install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))'

For a basic introduction to modifying Domino environments, watch this tutorial video.


Usage

You can find basic instructions on using aws.s3 from the package README. Below is a simple example for downloading a file where:

  • you have set up the correct environment variables with credentials for your AWS account
  • your account has access to an S3 bucket named my_bucket
  • the bucket contains an object named some_data.csv
# load the package
library("aws.s3")

#If you are using a credential file and that files has multiple profiles. Otherwise, this can be excluded.
Sys.setenv("AWS_PROFILE" = "<AWS profile>")

# download some_data.csv from my_bucket and write to ./some_data.csv locally
save_object("some_data.csv", file = "./some_data.csv", bucket = "my_bucket")

After running the above code, you would expect a local copy of some_data.csv to now exist in the same directory as your R script or notebook. You can then read from that local file to work with the data it contains.

myData <- read.csv(file="./some_data.csv", header=TRUE, sep=",")
View(myData)