Step 8: Working with Domino Datasets¶
If your Domino project uses a very large number of files (e.g., more than 10,000), or a single file larger than 8GB, consider using a Domino dataset.
To summarize the lifecycle of a dataset, consider this workflow:
- Datasets are defined in a .yaml file, along with input folders and output folders.
- A newly defined dataset will live in the input folder specified in the .yaml file. By default, the dataset in the input folder is read-only, while files in the output folder are writable.
- If you do not write anything to the output folder, the dataset will remain unchanged (as-is).
- Any files that you’d like to persist from the dataset in the input folder must be copied to the output folder.
- If you write to the output folder, the dataset files will be overwritten. Datasets, however, are saved as snapshots, allowing you to roll back to a previous snapshot of the dataset if needed.
This step will provide an example of how to use a dataset with our weather project.
Step 8.1: Add a dataset to your project¶
- Navigate to the Data section of your project.
- Click Create New Dataset.
- Enter a name (i.e., “get-started-MATLAB-dataset”) and description for your dataset, then click Upload Contents.
- Next, we’ll create a dataset by creating an initial snapshot (the initial version of your dataset). For our project, we’ll create the dataset using a MATLAB workspace. Click + New Workspace in the “Create with Interactive Workspace” option. The “Launch New Workspace” modal should appear.
- Select MATLAB as your workspace IDE. Click Launch. Your MATLAB workspace will launch with a new folder used to store the data that is part of your dataset.
- Locate this new folder by first clicking the “/” in the file path of your MATLAB workspace. Next, navigate to the dataset folder that Domino created for you. The folder path should be: /domino/datasets/domino/datasets/output/get-started-MATLAB-dataset. Let’s put some data in that folder.
- To populate the dataset, we will download a series of weather station files from the same NOAA repository we used earlier in the project. In our usual work directory (/mnt), create file called downloadToDatasetDir.m. In it, we’ll create a function to download the NOAA data:
function downloadToDatasetDir() % NOAA data URL baseUrlString = "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/"; % Prefix shared by weather stations in Argentina baseWeatherStationId = 'AR0000000'; % the location to save the files – the dataset output directory datasetFolder datasetFolder = "/domino/datasets/domino/datasets/output/get-started-MATLAB-dataset/"; % There are 16 weather station files. We will iterate and download each one for counter=1:16 if counter<10 weatherStationId = sprintf('%s%s%d', baseWeatherStationId, '0', counter); else weatherStationId = sprintf('%s%d', baseWeatherStationId, counter ) end urlString = sprintf("%s%s%s", baseUrlString, weatherStationId, ".csv") savedFileName = sprintf("%s%s%s", datasetFolder, weatherStationId, ".csv"); websave(savedFileName, urlString); end end
- Save the file, then run it from the Command Window in your MATLAB workspace. The output should look something like this as each file is downloaded.
- Now let’s save the files we downloaded to Domino. Stop your workspace by clicking the Stop button in your workspace. The “Uncommitted Changes” modal will appear. Enter a commit message and click Stop and Commit. Note that the files we saved to the dataset location will be saved as well.
- Domino will display confirmation messages. Proceed to shut down your workspace. Close the browser tab and return to the Domino project. Return to the Data section and notice that the dataset is now available to you.
- Click the dataset. You should see the list of files we just downloaded. They are now part of the dataset.
Step 8.2: Updating and writing to a dataset¶
The dataset you’ve created is now available in a read-only state. If you want to modify the dataset, or if your MATLAB algorithm will produce a very large number of files, you’ll need to create a new snapshot.
All new snapshots start out as empty and read-only. To create a snapshot that persists the previous snapshot’s data, you’ll need to copy the previous snapshot’s data into the new snapshot. To do this, we’ll need to create a file called domino.yaml.
- To create domino.yaml, navigate to the Files section of your project.
- Click the new file icon.
- In the file editor that appears, enter domino.yaml as the file name, and paste the following into the file’s body. Click Save.
datasetConfigurations: - name: "AppendRaw" inputs: - path: "raw-input" dataset: "get-started-MATLAB-dataset" outputs: - path: "raw-output" dataset: "get-started-MATLAB-dataset"
- The other change we need to make is to use domino.yaml when starting a new MATLAB workspace. Click the Quick Action button.
- Click Continue under the Launch a Workspace option.
- In the “Launch New Workspace” modal, select MATLAB as the workspace IDE and expand the “Data” section.
- Click the “Advanced” tab under “Data Configuration”.
- In the “Advanced” tab, click the “Configuration” drop-down list and select the AppendRaw configuration from the domino.yaml file.
- Click Launch to start the workspace. Startup time may be longer due to Domino’s need to mount the additional dataset.
- After the workspace launches you’ll be able to find your dataset under /domino/datasets/raw-input. Note that the current dataset snapshot is located where in the folder we specified in the domino.yaml file. The files in this folder are read-only and represent the last snapshot that was saved.
The /raw-output folder will hold the next snapshot of your dataset. If you leave it as-is and never write to it, it will keep the current snapshot of the dataset in place. However, if you write new files and commit them to Domino, they will overwrite and replace the current dataset.
- To demonstrate how to overwrite and update a dataset, let’s copy the existing dataset to the output folder, and then add a few new files. In the Command Window of your MATLAB workspace, enter the following line:
! cp -rf /domino/datasets/raw-input/* /domino/datasets/raw-output/
Note that this line starts with the
! to denote that this is a system command. You will notice that the input dataset is now present in the output dataset folder.
- Let’s update our code from the previous section and download a new set of weather data files into the /raw-output folder. To start, change the directory back to /mnt entering by typing the following into the Command Window:
- Open the file downloadToDatasetDir.m and save it as downloadToDatasetDir2.m in the same directory. Make the following changes to the file (different stations):
function downloadToDatasetDir2() baseUrlString = "https://www.ncei.noaa.gov/data/global-historical-climatology-network-daily/access/"; baseWeatherStationId = 'ASN000050cd'; % different stations datasetFolder = "/domino/datasets/raw-output/"; for counter=1:16 if counter<10 weatherStationId = sprintf('%s%s%d', baseWeatherStationId, '0', counter); else weatherStationId = sprintf('%s%d', baseWeatherStationId, counter ) end urlString = sprintf("%s%s%s", baseUrlString, weatherStationId, ".csv") savedFileName = sprintf("%s%s%s", datasetFolder, weatherStationId, ".csv"); websave(savedFileName, urlString); end end
- Run the file by clicking the Run button in MATLAB. The files will download to the /raw-output directory.
- Once the files are done downloading, click the Stop button in your Domino workspace. Domino will present the notification below. As you can see, Domino will save the output dataset if we stop and commit changes. Click Stop and Commit to end the session and save the new dataset snapshot.
- To verify the files were saved, navigate to the Data section in your Domino project.
- Click the dataset we just updated. The newly downloaded files will appear in the dataset listing.