Working with your cluster

Creating a cluster with workspaces

To create an on-demand Ray cluster attached to a Domino Workspace, click New Workspace from the Workspaces menu. On the Launch New Workspace dialog select the Compute Cluster step. Specify the desired cluster settings and launch your workspace. Once the workspace is up, it will have access to the Ray cluster you configured.

new_workspace_with_ray.png

Creating a cluster with jobs

Similarly to workspaces, to create an on-demand Ray cluster attached to a Domino job, click on Run from the Jobs menu. On the Start a Job dialog, select the Compute Cluster step. Specify the desired cluster settings and launch your job. The job will have access to the Ray cluster you configured.

new_job_with_ray.png

As your command, you can use any Python script that interacts with your Ray cluster.

Understanding your cluster settings

Domino makes it simple to specify key settings when creating a Ray cluster.

cluster_settings_ray.png

  • Number of Workers

    Number of Ray node workers that will make up the Ray cluster. The combined capacity of the workers will be available for your workloads.

  • Quota Max

    The maximum number of workers that you can make available to your cluster is limited by the number of per-user executions that your Domino administrator has configured for your deployment or by the maximum simultaneous executions of the underlying Hardware Tier used for workers.

    In addition to the number of Ray node workers, you will need 1 slot for your cluster master and 1 slot for your workspace or job.

  • Worker Hardware Tier

    The amount of compute resources (CPU, GPU, and memory) that will be made available to each Ray node worker.

  • Head Hardware Tier

    Same mechanics as the Worker Hardware Tier, but applied to the resources that will be available for your Ray cluster head node.

    The Ray head node is responsible for coordinating the Ray workers so does not need a significant amount of CPU resources. It will host the Ray Global Control Store, and the amount of required memory will depend on the complexity of your application.

  • Cluster Compute Environment

    Designates your compute environment for the Ray cluster.

  • Dedicated local storage per executor

    The amount of dedicated storage in Gigibytes (2^30 bytes) that will be available to each Ray worker.

    The storage will be automatically mounted to /tmp.

    The storage will be automatically provisioned when the cluster is created and de-provisioned when it is shut down.

    Warning

    The local storage per worker should not be used for storing any data which needs to be available after the cluster is shut down.

Connecting to your cluster

When provisioning your on-demand Ray cluster, Domino sets up environment variables that hold the information needed to easily connect to your cluster.

The following snippet can be used to connect.

Note

You will not use ray.init() but will need to use the Ray Client instead.

import ray
import ray.util
import os

...

if ray.is_initialized() == False:
   service_host = os.environ["RAY_HEAD_SERVICE_HOST"]
   service_port = os.environ["RAY_HEAD_SERVICE_PORT"]
   ray.util.connect(f"{service_host}:{service_port}")

Accessing the Ray Web UI

Ray provides a built-in dashboard with access to metrics, charts, and other features that helps understand the Ray cluster, libraries, and workloads.

The dashboard allows for the following:

  • View cluster metrics.

  • View logs, error, and exceptions across many machines in a single pane.

  • View resource utilization, tasks, and logs per node and per actor.

  • Kill actors and profile Ray jobs.

  • See Tune jobs and trial information.

Domino makes the Ray Web UI available for active on-demand clusters attached to both workspaces and jobs.

Ray UI from Workspaces

The Ray UI is available from a dedicated tab in your workspace.

access_ui_tabs_ray.png

Ray UI from Jobs

The Ray UI is also available for running jobs from Details tab.

access_ui_job_ray.png

Cluster lifecycle

On workspace or job startup, a Domino on-demand Ray cluster with the desired cluster settings is automatically provisioned and attached to the workspace or job as soon as the cluster becomes available.

On workspace or job termination, the on-demand Ray cluster and all associated resources are automatically terminated and de-provisioned. This includes any compute resources and storage allocated for the cluster.

Cluster network security

The on-demand Ray clusters created by Domino are not meant for sharing between multiple users. Each cluster is associated with a given workspace or a job instance. Access to the cluster and the Ray Web UI is restricted only to users who can access the workspace or the job attached to it. This restriction is enforced at the networking level and the cluster is only reachable from the execution that provisioned it.