Every flow, task, and execution are uniquely versioned in Domino Flows to guarantee reproducibility.
Each time a flow is executed through the CLI, the exact definition is serialized and uploaded to be stored in the Flyte blob storage. This includes:
-
The structure of the flow and tasks as defined in the function wrapped by the
@workflow
decorator at the time of execution. Implicitly, this automatically captures the exact code definition for the flow at the time of execution, without the need to make a manual commit. -
The Domino Job config that is defined as part of each task and includes a precise code commit, environment version, hardware tier, dataset snapshots, and all other properties that were defined as part of a job.
-
The input and output parameters that can be configured or modified for each execution.
Whenever the serialized definition of the flow/tasks do not match a previously registered version, a new version of the entity is created and is tied to every execution that is instantiated from it. This is what ensures that all re-executions of a flow with the same parameters will always produce consistent results.
While the automatic versioning of entities in flow helps to maximize reproducibility, there are additional practices that are recommended in order to guarantee it. More specifically:
-
When the
use_latest
parameter is set when defining tasks, Domino will use project defaults or the latest versions of Domino entities for parameters that have not been explicitly defined at the time of execution. While this is useful for quick prototyping, it is recommended to explicitly define all parameters before putting a flow into production. This will ensure that the definition of the serialized flow version doesn’t change if project defaults change or an environment gets updated.
Note use_latest
in this example is useful for development and iteration:
DominoJobTask(
name='Add numbers',
domino_job_config=DominoJobConfig(Command="python add.py"),
inputs={'first_value': int, 'second_value': int},
outputs={'sum': int},
use_latest=True
)
A better definition for production use omits use_latest
in favor of explicitly defining all parameters:
DominoJobTask(
name='Add numbers',
domino_job_config=DominoJobConfig(
Command="python add.py",
CommitId="4e6f2ee71e3bd64eaa90ce826c7d523b29f179cd",
MainRepoGitRef=None,
HardwareTierId="large-k8s",
EnvironmentId="66b5324c495c6c124cfb0a28",
EnvironmentRevisionSpec=EnvironmentRevisionSpecification(EnvironmentRevisionType.SomeRevision, "4"),
ComputeClusterProperties=None,
VolumeSizeGiB=10,
DatasetSnapshots=[
DatasetSnapshot(Name="quick-start", Version=1)
],
ExternalVolumeMountIds=[],
),
inputs={'first_value': int, 'second_value': int},
outputs={'sum': int}
)
-
When loading data from external data sources, it is recommended to have an initial task for making a snapshot of the data (i.e., a task that loads the data into Domino and copies it directly as an output). This ensures that changes in external data sources won’t affect the ability to reproduce a result.
-
Always write results as explicit outputs for the task, rather than to an external data location.