When using a Domino on-demand Dask cluster, any data that will be created or modified as part of the interaction needs to go into an external data store.
Warning
|
On-demand clusters in Domino are ephemeral. Any data that is stored on cluster local storage and not externally will be lost upon termination of the workload and the cluster. |
When you create a Dask cluster attached to a Domino workspace or job, any Domino dataset accessible from the workspace or job will also be accessible from all components of the cluster under the same dataset mount path. You will then be able to access the files from your code using the same path regardless of whether your code runs on your workspace or job container or in a Dask task on the cluster.
For example, to read a file you would use the following.
import dask.dataframe as dd
df = dd.read_parquet("/mnt/data/my_dataset/large_dataset.parquet")
To access Amazon S3 (or S3 compatible object store) data with Dask, you can use any of the libraries you already use (for example, boto3
, s3fs
) to pull down files from S3.
For structured data, you can also read it directly into Dask dataframes of bags.
You would need to specify the s3://
as the protocol.
The following is a basic example.
import dask.dataframe as dd
df = dd.read_parquet("s3://bucket/path/data-*.parquet")
df = dd.read_csv("s3://bucket/path/data-*.csv")
Note
|
Dask uses |
Additional parameters (for example, auth keys) can be passed through the
storage_options
.
For full documentation of the S3 specific options (including loading data from S3 compatible services), refer to the relevant section of the Dask documentation.
AWS credential propagation
When AWS credential propagation is enabled for your deployment, temporary AWS credentials corresponding to the roles enabled for you in your company identity provider will be automatically available on all Dask workers and your execution.
The credentials will be automatically refreshed and available under a profile name corresponding to each role in an AWS credential file.
The location of the file is stored in the AWS_SHARED_CREDENTIALS_FILE
environment variable, which puts in the proper search path for s3fs
and
boto3
.
You will be able to specify the name of the profile that corresponds to the role that you would want to use for authentication. You can do the following:
import dask.dataframe as dd
df = dd.read_parquet("s3://bucket/path/data-*.parquet",
storage_options={
"profile_name"="my-role-profile",
})
Similar to S3, Dask can load data from Microsoft Azure Storage, Google Cloud Storage, HDFS, HTTP, NFS, and your local file system.
Detailed documentation describing the protocol to use, the required packages, and the available storage_options
can be found in the
Remote data section of the Dask documentation.