You can use the Domino Reproducibility Engine to track external data in Domino for reproducibility and auditability by persisting the data in Domino when your code runs. If your code connects to an external data source (e.g., a database, an API, a network drive, or a Domino Data Source), Domino does not know about the data coming from those sources.
To track external data for reproducibility reasons, you can materialize the in-memory data as a file so that the Domino Reproducibility Engine can track it. You can materialize it using any of the following options:
-
A file like a CSV or Parquet in the Domino File System (DFS)
Tip
| For best performance, keep your results set under 10GB if you are persisting files to the Domino file system. |
Use Datasets to track external data
Persist a data frame as a Domino Dataset using the following code:
To materialize a dataframe as a Dataset in Python, run the following:
# Save data to the "quick-start" dataset
csv_file_path = "/domino/datasets/local/quick-start/data.csv"
parquet_file_path = "/domino/datasets/local/quick-start/data.parquet"
# Write the DataFrame to a CSV file
df.to_csv(csv_file_path, encoding='utf-8', index=False)
# Write the DataFrame to a binary parquet format
df.to_parquet(parquet_file_path)
Store files in the DFS to track external data
Use the following code to persist a data frame as a CSV or binary file in the Domino File System.
To materialize a data frame as a file in your working directory, run the following Python code in a notebook code cell, then sync the file changes to persist the file in the backing repository.
## Save data to the project's default working directory `mnt` as an artifact add an additional sync step
csv_file_path = "/mnt/data.csv"
parquet_file_path = "/mnt/data.parquet"
# Write the DataFrame to a CSV file
df.to_csv(csv_file_path, encoding='utf-8', index=False)
# Write the DataFrame to a binary parquet format
df.to_parquet(parquet_file_path)
Use TrainingSets to track external data
To save a data frame as a TrainingSet, run the following Python code.
from domino.training_sets import TrainingSetClient
training_set_version = TrainingSetClient.create_training_set_version(
training_set_name="my-training-set",
df=df
)
# To retrieve the data back as Data Frame
# tsv_by_num = TrainingSetClient.get_training_set_version(
# training_set_name="my-training-set",
# number=1,
# )
# training_df = tsv_by_num.load_training_pandas()
For external data accessed using a Domino Data Source connector, you can also use the Data Source audit log to track Data Source activity.
Data Source logs provide a way to track user activity, which can be helpful for reproducibility and lineage. However, Domino cannot reproduce the state of your Data Source at the time of execution.
This section explains how to ensure workflow reproducibility in Domino:
- Reproducibility use cases
-
Learn how to reproduce the results of a Job, Workspace, Model, App, or Launcher.
- Selectively revert past materials
-
Selectively restore a part of a Project, such as the package library version, while keeping your latest code and data.
- File syncing and persistence
-
Domino tracks project files automatically, storing previous versions in the blob store.
- Remove a file from the DRE: Permanent deletion
-
Purge a file completely and permanently from the blob store.
- Tips for reproducibility in Domino
-
Tips for maximizing the power of the Domino Reproducibility Engine.