How Domino handles large Datasets

When you start an execution, Domino copies your project files to the executor that is hosting the execution. After every execution in Domino, by default, Domino will try to write all files in the working directory back to the project as a new revision. When working with large volumes of data, this presents two potential problems:

The number of files that are written back to the project might exceed the configurable limit. By default, the file limit for Domino project files is 10,000 files.
The time required for the write process is proportional to the size of the data. It can take a long time if the size of the data is large.

The following table shows the recommended solutions for these problems. They are described in further detail after the table.

Case	Data size	# of files	Static / Dynamic	Solution
Large volume of static data	Unlimited	Unlimited	Static	Domino Datasets
Large volume of dynamic data	Up to 300GB	Unlimited	Dynamic	Project Data Compression

Case

Data size

# of files

Static / Dynamic

Solution

Large volume of static data

Unlimited

Static

Domino Datasets

Large volume of dynamic data

Up to 300GB

Unlimited

Dynamic

Project Data Compression

Domino Datasets

When working on image recognition or image classification deep learning projects, you often need a training Dataset of thousands of images. The total Dataset size can easily become tens of GB. For these types of projects, the initial training also uses a static Dataset. The data is not constantly being changed or updated. Furthermore, the actual data that is used is normally processed into a single large tensor.

Store your processed data in a Domino Dataset. Datasets can be mounted by other Domino projects, where they are attached as a read-only network filesystem to that project’s runs and workspaces.

Project data compression

Sometimes, you must work with logs as raw text files. Typically, new log files are constantly being updated, so your Dataset is dynamic. You might encounter both problems described previously at the same time:

The number of files are over the 10k limit.
There are long times to prepare and sync data.

Domino recommends that you store these files in a compressed format. If you need the files to be in an uncompressed format during your Run, you can use Domino Compute Environments to prepare the files. In the pre-run script, you can uncompress your files:

tar -xvzf many_files_compressed.tar.gz

Then in the post-run script, you can re-compress the directory:

tar -cvzf many_files_compressed.tar.gz /path/to/directory-or-file

If your compressed file is still large, the time to prepare and sync might still be long, depending on how large your compressed file is. Consider storing these files in a Domino Dataset to minimize the time to copy.

Dataset quotas and limits

Admins can set quotas and limits on Dataset storage in configuration records. Contact your admin to configure quotas.

As you approach your quota limit, you receive notifications, emails, and warnings on Dataset pages.

Your Dataset storage usage is the sum of the size of all active Datasets that you own. If a Dataset has multiple owner, then the size of that Dataset counts towards the storage usage of each owner. For more information, see Dataset Roles.

Next steps

Get the path to a snapshot
Learn about Dataset best practices

User Guide

Admin Guide

API Guide

Release Notes

How Domino handles large Datasets

Domino Datasets

Project data compression

Dataset quotas and limits

Next steps