Orchestrate Jobs with Apache Airflow

Data science projects often require multiple steps to go from raw data to useful data products. These steps tend to be sequential, and involve things like:

  • Sourcing data

  • Cleaning data

  • Processing data

  • Training models

After you understand the steps necessary to deliver results from your work, it’s useful to automate them as a repeatable pipeline. Domino can schedule Jobs, but for more complex pipelines you can pair Domino with an external scheduling system like Apache Airflow.

This topic describes how to integrate Airflow with Domino by using the python-domino package.

Get started with Airflow

Airflow is an open-source platform to author, schedule, and monitor pipelines of programmatic tasks. You can define pipelines with code and configure the Airflow scheduler to execute the underlying tasks. You can use the Airflow application to visualize, monitor, and troubleshoot pipelines.

If you are new to Airflow, read the Airflow QuickStart to set up your own Airflow server.

There are many options for configuring your Airflow server, and for pipelines that can run parallel tasks, you must use Airflow’s LocalExecutor mode. In this mode, you can run tasks in parallel and execute multiple dependencies at the same time. Airflow uses a database to keep records of all the tasks it executes and schedules, so you must install and configure a SQL database for LocalExecutor mode.

Read A Guide On How To Build An Airflow Server/Cluster to learn more about setting up LocalExecutor mode.

For more information about scheduling and triggers, notifications, and pipeline monitoring, read the Airflow documentation.

Install python-domino on your Airflow workers

To create Airflow tasks that work with Domino, you must install python-domino on your Airflow workers. Use this library to add tasks in your pipeline code that interact with the Domino Platform API to start Jobs.

Connect to your Airflow workers, and follow these steps to install and configure python-domino:

  1. Install from pip

    pip install dominodatalab

  2. Set up an Airflow variable to point to the Domino host. This is the URL where you load the Domino application in your browser.

    Key: DOMINO_API_HOST
    Value: <your-domino-url>
  3. Set up an Airflow variable to store the user API key you want to use with Airflow. This is the user Airflow with authentication to Domino to start Jobs.

    Key: DOMINO_API_KEY
    Value: <your-api-key>

How Airflow tasks map to Domino Jobs

Airflow pipelines are defined with Python code. This fits in well with Domino’s code-first philosophy. You can use python-domino in your pipeline definitions to create tasks that start Jobs in Domino.

Architecturally, Airflow has its own server and worker nodes, and Airflow will operate as an independent service that sits outside of your Domino deployment. Airflow will need network connectivity to Domino so its workers can access the Domino Platform API to start Jobs in your Domino Project. All the code that performs the actual work in each step of the pipeline — code that fetches data, cleans data, and trains data science models — is maintained and versioned in your Domino Project. This way you have Domino’s Reproducibility Engine working together with Airflow’s scheduler.

Airflow pipeline

Next steps

Learn how to schedule Jobs and view Job results.