Access data with Ray

When using a Domino on-demand Ray cluster, any data that will be used, created, or modified as part of the interaction needs to go into an external data store.

Warning

On-demand clusters in Domino are ephemeral. Any data that is stored on cluster local storage and not externally will be lost upon termination of the workload and the cluster.

Use Domino Datasets

When you create a Ray cluster attached to a Domino workspace or job, any Domino dataset accessible from the workspace or job will also be accessible from all components of the cluster under the same dataset mount path. You will then be able to access the files from your code using the same path regardless of whether your code runs on your workspace of job container or in a Ray task on the cluster.

For example, to read a file you would use the following:

file = open("/mnt/data/my_dataset/file.csv")

Using S3

To access Amazon S3 (or S3 compatible object store) data with Ray you can use any of the libraries you already use (for example, boto3, s3fs, or pandas). When access will happen from Ray workers, the only prerequisite is to make sure that the relevant libraries and dependencies are available on both the base cluster environment and the execution environment.

AWS credential propagation

When AWS credential propagation is enabled for your deployment, temporary AWS credentials corresponding to the roles enabled for you in your company identity provider will be automatically available on all Ray worker nodes.

The credentials will be automatically refreshed and available under a profile name corresponding to each role in an AWS credential file. Ray worker code can then utilize these credentials for seamless and secure access.