Domino Flows enables efficient orchestration and monitoring of complex, interconnected multi-step processes while ensuring full lineage and reliable reproducibility. The processes, implemented as Domino jobs, are tasks and the complete structure of connections between tasks is a workflow. Tasks produce outputs that become inputs to other tasks, forming the basis for the connections. A Flow definition constructs a DAG (directed acyclic graph).
Note
| To support reproducibility, each task must be side-effect free by reading versioned inputs, and writing defined outputs. |
Flows is flexible enough to declaratively model arbitrarily complex processes. Dependency relationships between tasks determine the order in which they run and whether they can be parallelized. Scenarios spanning machine learning, data engineering, and data analytics benefit from this level of control and reproducibility.
For instance, Flows would be an ideal choice for scenarios like:
-
Executing a data processing workflow in Dask prior to a training workflow in XGBoost
-
Running a clinical study pipeline by loading SDTM datasets to produce ADaM datasets and TFL reports
-
Collecting image metadata from S3 with Spark and performing model inference with PyTorch
-
Loading financial data from Snowflake, processing it for use in a Ray training job that registers a model in MLflow
-
Processing a local protein database to search for a nucleotide sequence and generating a scatterplot
Flows may not be the most appropriate choice to use when modeling a process that accesses a single dataset and performs many small computations in a homogenous environment. Tasks that write to mutable shared state (like read-write datasets) cannot be used in Flows, but can be made compatible with modifications.
Flows extends the Domino Job system with key new functionality including:
-
Programmatic Python based authoring of versioned, reusable, repeatable, immutable workflows
-
Strongly typed definitions of inputs being consumed and outputs being produced for each task
-
Automatic lineage and versioning of all task and workflow inputs and outputs
-
Heterogenous, isolated environment support for any task
-
Stronger reproducibility requirements and guarantees
-
Visualization of the workflow execution graph and the ability to inspect and monitor each task, its inputs and outputs
-
Parallel execution of tasks at scale
-
Configurable caching and task result reuse anywhere within the workflow
-
Flow Artifacts for discovery, inspection and reuse of specially annotated outputs within a project
-
Automatic recovery from intermittent failures and manual recovery of partial executions
Read more about the differences between Flows-generated tasks that run Domino jobs vs standalone Domino Jobs.
Flows is built on the open-source framework Flyte.
Some key terms to understand before getting started with Flows include:
Term | Definition |
---|---|
Task | Tasks are the core building blocks within a flow and are isolated within their own container during an execution. A task maps to a single Domino Job. |
Flow | A flow is a composition of multiple tasks or other flows (called subflows). Flows can be triggered through a single command and are tracked as a single, fully reproducible entity. |
Node | A node represents a unit of execution or work within a flow (they show up as individual blocks in the graph views). A node can contain either a single task or a whole flow (called subflows). |
Task inputs | Task inputs are strongly typed parameters that can be defined on individual tasks. Inputs allow tasks to be rerun with different settings through the UI, without the need to modify the code itself. Inputs can be read and used within executions. |
Task outputs | Task outputs are strongly typed parameters that define the results that are produced by a task. Outputs are tracked and stored in discrete blob storage, so that they can be used as input to another task. |
Flow inputs/outputs | Flow inputs/outputs are similar to the task inputs/outputs but are defined at the flow level. Inputs defined for a flow can be passed into relevant tasks, and outputs from tasks can be returned as the overall output for a flow. |