On-Demand Ray Overview

Ray Overview

Ray.io is a distributed execution framework that makes it easy to scale your single machine applications, with little or no changes, and to leverage state of the art machine learning libraries.

Ray provides a set of core low-level primitives (Ray Core) as well as a family of pre-packaged libraries that take advantage of these primitives to enable solving powerful machine learning problems.

The following libraries come packaged with Ray:

Additionally, Ray has been adopted as a foundational framework by a large number of open source ML frameworks which now have community maintained Ray integrations.

Orchestrating Ray on Domino

Domino offers the ability to dynamically provision and orchestrate a Ray cluster directly on the infrastructure backing the Domino instance. This allows Domino users to get quick access to Ray without having to rely on their IT team to create and manage a cluster for them.

When you start a Domino workspace for interactive work or a Domino job for batch processing, Domino will create, manage for you, and make available to your execution a containerized Ray cluster.

Suitable use cases

Domino on-demand Ray clusters are suitable for the following workloads:

  • Distributed multi-node training RaySGD provides a lightweight mechanism for taking existing PyTorch and Tensorflow models and scaling them across multiple machines to dramatically reduce training times. Ray is suitable for both distributed CPU and GPU training.

  • Hyperparameter optimization Launch a distributed hyperparameter sweep with just a few lines of code and no adaptation of your existing training harness. Take advantage of a large number of advanced parameter search algorithms.

  • Reinforcement learning Ray, in combination with the RLlib library, allows you to take advantage of a number of built-in reinforcement learning algorithms but also provides a general framework for developing your own.


While Ray offers a scalable serving capability, the ephemeral nature of the Domino Ray clusters does not make it a good fit for this use case.