Orchestrate Jobs with Kubeflow

Data science projects often require multiple steps to go from raw data to useful data products. These steps tend to be sequential, and involve things like:

Sourcing data
Cleaning data
Processing data
Training models

After you understand the steps necessary to deliver results from your work, it’s useful to automate them as a repeatable pipeline. Domino can schedule Jobs, but for more complex pipelines you can pair Domino with an external scheduling system like Kubeflow.

This topic describes how to integrate Kubeflow with Domino by using the python-domino package.

Get started with Kubeflow

Kubeflow is an open-source platform designed for machine learning workflows on Kubernetes. It facilitates the orchestration of ML workflows, allowing you to define and manage machine learning pipelines.

If you’re new to Kubeflow, explore the Kubeflow documentation for setting up and configuring your Kubeflow environment.

Configuring your Kubeflow environment might involve setting up a Kubernetes cluster, establishing the necessary networking, and defining resources for your pipeline execution.

For more detailed information on setting up Kubeflow and managing your machine learning workflows, refer to the Kubeflow documentation.

Create pipelines

To create Kubeflow pipelines that interact with Domino, you’ll need to install the python-domino package on your Kubeflow cluster.

To install the required package, follow these steps:

Access your Kubeflow cluster.
Install the domino-kubeflow package using the following command:
```
pip install dominodatalab
```
Configure the connection between Kubeflow and Domino by setting up environment variables that point to the Domino host and store your Domino API key.
Set an environment variable to point to the Domino host:
```
export DOMINO_API_HOST=<your-domino-url>
```
Store the user API key you want to use to authenticate into Domino:
```
export DOMINO_API_KEY=<your-api-key>
```

Map Kubeflow pipelines to Domino Jobs

Domino’s code-first approach aligns well with defining tasks within Kubeflow pipelines to initiate Jobs in Domino.

Architecturally, Kubeflow pipelines are defined, managed, and operated within a Kubernetes cluster, separate from your Domino Environment. Kubeflow will require network connectivity to access the Domino API for executing Jobs in your Domino Projects. The code for each step within the pipeline, such as data retrieval, cleaning, and model training, is stored and versioned in your Domino Project. This integration ensures collaboration between Kubeflow’s orchestration capabilities and Domino’s reproducibility features.

Next steps

Learn how to schedule Jobs and view Job results.

User Guide

Admin Guide

API Guide

Release Notes

Orchestrate Jobs with Kubeflow

Get started with Kubeflow

Create pipelines

Map Kubeflow pipelines to Domino Jobs

Next steps