Data science projects often require multiple steps to go from raw data to useful data products. These steps tend to be sequential, and involve things like:
-
Sourcing data
-
Cleaning data
-
Processing data
-
Training models
After you understand the steps necessary to deliver results from your work, it’s useful to automate them as a repeatable pipeline. Domino can schedule Jobs, but for more complex pipelines you can pair Domino with an external scheduling system like Kubeflow.
This topic describes how to integrate Kubeflow with Domino by using the python-domino package.
Kubeflow is an open-source platform designed for machine learning workflows on Kubernetes. It facilitates the orchestration of ML workflows, allowing you to define and manage machine learning pipelines.
If you’re new to Kubeflow, explore the Kubeflow documentation for setting up and configuring your Kubeflow environment.
Configuring your Kubeflow environment might involve setting up a Kubernetes cluster, establishing the necessary networking, and defining resources for your pipeline execution.
For more detailed information on setting up Kubeflow and managing your machine learning workflows, refer to the Kubeflow documentation.
To create Kubeflow pipelines that interact with Domino, you’ll need to install the python-domino package on your Kubeflow cluster.
To install the required package, follow these steps:
-
Access your Kubeflow cluster.
-
Install the domino-kubeflow package using the following command:
pip install dominodatalab
Configure the connection between Kubeflow and Domino by setting up environment variables that point to the Domino host and store your Domino API key.
-
Set an environment variable to point to the Domino host:
export DOMINO_API_HOST=<your-domino-url>
-
Store the user API key you want to use to authenticate into Domino:
export DOMINO_API_KEY=<your-api-key>
Domino’s code-first approach aligns well with defining tasks within Kubeflow pipelines to initiate Jobs in Domino.
Architecturally, Kubeflow pipelines are defined, managed, and operated within a Kubernetes cluster, separate from your Domino Environment. Kubeflow will require network connectivity to access the Domino API for executing Jobs in your Domino Projects. The code for each step within the pipeline, such as data retrieval, cleaning, and model training, is stored and versioned in your Domino Project. This integration ensures collaboration between Kubeflow’s orchestration capabilities and Domino’s reproducibility features.
Below is an example of configuring a Kubeflow pipeline that interacts with Domino to perform sequential steps such as data fetching, processing, model training, and report generation:
Learn how to schedule Jobs and view Job results.