Learn how to make the most of the Domino Reproducibility Engine (DRE) with these tips. Since the DRE operates at a low level of abstraction like tracking Docker images and files, it offers a lot of flexibility and power.
For results that have the strongest reproducibility requirements (e.g., for regulatory or compliance reasons), produce these results by executing Jobs, not through Workspaces. Since a Workspace is interactive, it’s impossible to make every aspect of its state reproducible (for example, a data scientist can execute cells in a different order). Jobs are headless, meaning they run start-to-stop making them better suited for strong reproducibility.
Use explicit version numbers when installing packages in Python or R, and consolidate package installation in one place.
It’s best to put these in your compute Environment definition. Remember, however, if Domino detects a requirements.txt
file in your Project files, it will pip install
(or conda install
) the requirements file before it executes your code. An analogous pattern that some R practitioners use is to have a single install.R
file in their Project, which they run before anything else.
If your package definitions are in a file like this, note that your software state is spread across both your compute Environment and your Project files. However, you may find this advantageous in some circumstances because you can use Domino’s comparison feature to compare two versions of requirements.txt
or install.R
to see exactly what changed.
Regardless, use explicit package versions, e.g., pip install torch==1.12.1
rather than pip install torch
That way, you’ll get the same results running the code in the future, even if the latest version is newer.
For instructions on installing specific versions of R packages, see this Stackoverflow post.
Because Domino keeps your files revisioned automatically, it’s best if you don’t encode file names with your own versioning scheme. For example, you don’t need to append “v1”, “v2”, “final”, etc. to your version names. In fact, it’s better if you don’t do this, because Domino treats files with the same name as changes when it presents comparisons between Jobs.
Because the DRE tracks all files that a Job produces, there are many clever tricks you can use to track things in files. For example:
-
If you are using random numbers in your code, you can persist the seed value to a file, so that you can recreate it later if you need to reproduce the precise random behavior. In Python, you can get the state of the random generator and pickle it to a file. In R you can use
set.seed
to set your own value (which you need to select on your own). -
Generalizing the above: R and Python both have powerful capabilities for persisting memory state to files. In R you can save objects, even for your entire set of variables. Python lets you pickle objects. You can use these techniques to easily create revisioned artifacts of intermediate values you’ve calculated. This may be useful to save time later, by resuming a computationally intensive process from an intermediate state instead of starting from scratch.
-
You can create your own domain-specific formats with customized results you may want to track and compare over time. For example, you can create a “model summary.pdf” that has the same format between different Jobs.
These topics in this section explain how you can make your workflows reproducible in Domino.
- Reproducibility use cases
-
Learn how to reproduce the results of a Job, Workspace, Model, App, or Launcher.
- Selectively revert past materials
-
Selectively restore a part of a Project, such as the package library version, while keeping your latest code and data.
- File syncing and persistence
-
Domino automatically tracks files in your Project and keeps previous versions in the blob store.
- Remove a file from the DRE: Permanent deletion
-
Purge a file completely and permanently from the blob store.
- Track external data
-
Materialize external data as a file in Domino to benefit from the automatic tracking that Domino provides.