domino logo
Tech Ecosystem
Get started with Python
Step 0: Orient yourself to DominoStep 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get started with R
Step 0: Orient yourself to Domino (R Tutorial)Step 1: Create a projectStep 2: Configure your projectStep 3: Start a workspaceStep 4: Get your files and dataStep 5: Develop your modelStep 6: Clean up WorkspacesStep 7: Deploy your model
Get Started with MATLAB
Step 1: Orient yourself to DominoStep 2: Create a Domino ProjectStep 3: Configure Your Domino ProjectStep 4: Start a MATLAB WorkspaceStep 5: Fetch and Save Your DataStep 6: Develop Your ModelStep 7: Clean Up Your Workspace
Step 8: Deploy Your Model
Scheduled JobsLaunchers
Step 9: Working with Domino Datasets
Domino Reference
Projects
Projects OverviewProjects PortfolioProject Goals in Domino 4+Jira Integration in DominoUpload Files to Domino using your BrowserCopy ProjectsFork and Merge ProjectsSearchSharing and CollaborationCommentsDomino File SystemCompare File Revisions
Revert Projects and Files
Revert a FileRevert a Project
Archive a Project
Advanced Project Settings
Project DependenciesProject TagsRename a ProjectSet up your Project to Ignore FilesUpload files larger than 550MBExporting Files as a Python or R PackageTransfer Project Ownership
Domino Runs
JobsDiagnostic Statistics with dominostats.jsonNotificationsResultsRun Comparison
Advanced Options for Domino Runs
Run StatesDomino Environment VariablesEnvironment Variables for Secure Credential StorageUse Apache Airflow with Domino
Scheduled Jobs
Domino Workspaces
WorkspacesUse Git in Your WorkspaceUse Visual Studio Code in Domino WorkspacesPersist RStudio PreferencesAccess Multiple Hosted Applications in one Workspace Session
Spark on Domino
On-Demand Spark
On-Demand Spark OverviewValidated Spark VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
External Hadoop and Spark
Hadoop and Spark OverviewConnect to a Cloudera CDH5 cluster from DominoConnect to a Hortonworks cluster from DominoConnect to a MapR cluster from DominoConnect to an Amazon EMR cluster from DominoRun Local Spark on a Domino ExecutorUse PySpark in Jupyter WorkspacesKerberos Authentication
On-Demand Ray
On-Demand Ray OverviewValidated Ray VersionConfigure PrerequisitesWork with your ClusterManage DependenciesWork with Data
Customize the Domino Software Environment
Environment ManagementDomino Standard EnvironmentsInstall Packages and DependenciesAdd Workspace IDEs
Partner Environments for Domino
Use MATLAB as a WorkspaceCreate a SAS Data Science Workspace EnvironmentNVIDIA NGC Containers
Advanced Options for Domino Software Environment
Install Custom Packages in Domino with Git IntegrationAdd Custom DNS Servers to Your Domino EnvironmentConfigure a Compute Environment to User Private Cran/Conda/PyPi MirrorsScala notebooksUse TensorBoard in Jupyter Workspaces
Publish your Work
Publish a Model API
Model Publishing OverviewModel Invocation SettingsModel Access and CollaborationModel Deployment ConfigurationPromote Projects to ProductionExport Model Image
Publish a Web Application
App Publishing OverviewGet Started with DashGet Started with ShinyGet Started with FlaskContent Security Policies for Web Apps
Advanced Web Application Settings in Domino
App Scaling and PerformanceHost HTML Pages from DominoHow to Get the Domino Username of an App Viewer
Launchers
Launchers OverviewAdvanced Launcher Editor
Assets Portfolio Overview
Connect to your Data
Data in Domino
Datasets OverviewDatasets Best Practices
Data Sources Overview
Connect to Data Sources
External Data Volumes
Git and Domino
Git Repositories in DominoWork From a Commit ID in Git
Work with Data Best Practices
Work with Big Data in DominoWork with Lots of FilesMove Data Over a Network
Advanced User Configuration Settings
User API KeysDomino TokenOrganizations Overview
Use the Domino Command Line Interface (CLI)
Install the Domino Command Line (CLI)Domino CLI ReferenceDownload Files with the CLIForce-Restore a Local ProjectMove a Project Between Domino DeploymentsUse the Domino CLI Behind a Proxy
Browser Support
Get Help with Domino
Additional ResourcesGet Domino VersionContact Domino Technical SupportSupport Bundles
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
User Guide
>
Domino Reference
>
Connect to your Data
>
Data in Domino
>
Datasets Overview

Datasets Overview

Domino Datasets provide high-performance, versioned, and structured filesystem storage in Domino. With Domino Datasets, you can build several curated collections of data in one project, and share them with your fellow contributors across their projects.

A Domino Dataset is a collection of files that are available in user executions as a filesystem directory. A Dataset always reflects the most recent version of the data. You can modify the contents of a Dataset through the Domino UI or through workload executions at any time.

You can version the contents of a Domino Dataset by creating a Snapshot containing a read-only copy of the Dataset files at a given time. Snapshots are associated with the Dataset they version.

The following are the primary ways to interact with a Domino Dataset:

  • Work with Datasets local to your project

  • Read from a shared Dataset you have mounted to your project

Create local datasets

Domino Datasets belong to Domino projects. Permission to read and write from a dataset is granted to project contributors, just like the behavior of project files. A Dataset that belongs to a project is considered to be local to that project.

You can have five datasets in a project, by default. See read-write datasets.

Create a new Dataset in your project:
  1. Click Data from the project menu, then click Create New Dataset.

    datasets empty state
  2. Enter a name and optional description, then click Create Dataset.

  3. Upload data in your browser. To preserve the filesystem structure of your uploads, you can drag and drop directories and subdirectories. Additionally, you can pause and resume the upload as needed.

    Datasets Upload

Note

The browser upload is suitable for up to 50GB or 50,000 individual files. For larger uploads, Domino recommends that you use the Domino CLI for your upload. To do this, run the following command, adjusted for your dataset and file path:

domino upload-dataset <project_owner>/<project_name>/<dataset_name> <path-to-folder>
domino upload-dataset <project_owner>/<project_name>/<dataset_name> <path-to-folder>

For information about how to install and configure the Domino CLI, see this topic.

You can modify the contents of a Dataset at any time. A simple way to do this is through the Domino Dataset page.

Datasets Manage Files

Use shared datasets

To access the contents of an existing Dataset that is not in your project, mount the target Dataset in your project.

Mount a Dataset:
  1. Click Datasets from the Project menu, then click Mount Shared Dataset.

  2. Click Dataset to Mount to see a list of Datasets to which you have access. Select the dataset that you want to mount in this project.

    To access a Dataset, you must be an Owner, Contributor, Project Importer, or Results Consumer on the project on the project that contains the Dataset.

Under Shared Datasets you can see the Dataset that you mounted. The Path for the Dataset points to a directory where you will find the mounted Dataset in your project’s executions. When mounted this way, the Dataset as well as any associated snapshots are read-only.

Datasets Shared Datasets

You can remove a shared dataset at any point by selecting the Unmount action associated with that dataset on the Datasets page.

Note

Create snapshots

To version the contents of a Dataset, you can create a Snapshot. Snapshots are read-only, immutable states of the dataset. You can create multiple snapshots, but cannot modify existing snapshots.

  1. From the Datasets page of your project, click the name of the Dataset you want to version to open its overview page.

  2. Click Take Snapshot.

    datasets take snapshot

    By default, you can create a Snapshot that will copy all files in the Dataset. Alternatively, you can select a subset of the files and folders to include in the Snapshot.

    Datasets Take Snapshot

  3. When prompted, initiate the dataset creation process.

    You can specify a tag that can be used to mount the snapshot under a friendly name in subsequent executions. You can see a preliminary estimate of how long the snapshot creation will take based on some basic heuristics. The estimate will be refined once the process is underway.

    Datasets Confirm Snapshot

    Note

While a snapshot is in progress you can cancel it from the Dataset overview page. If you cancel a snapshot, any partial snapshot data will be automatically deleted.

Tip

Manage datasets and snapshots

From the Datasets page of your project, click the name of a dataset to open its Overview page. You can see the Dataset name and description, buttons to rename, mark for deletion, or upload files to the Dataset. You can also take a snapshot.

By default, the page shows the latest files and folders in the Dataset. If snapshots have been created, you can also select a Snapshots to toggle to a particular snapshot and examine its contents.

For a snapshot, you can perform the following actions:

  • Add Tag - Tags create a friendly path when mounting a snapshot inside executions. A Dataset owner can move a tag between different snapshots to provide a stable path to whichever snapshot holds the desired state of the data.

    Note

    If more than one tag is used, the last added tag will be used for mounting purposes.

  • Mark for Deletion - When a snapshot is no longer needed, you can mark it for deletion. Such snapshots will no longer be mounted in subsequent executions. The Snapshot will be flagged to a Domino administrator as ready for deletion, but will not be fully deleted until the administrator takes an additional action to delete it.

If you no longer need the entire dataset, you can mark it for deletion. Similar to Snapshots, a Domino administrator must perform the final deletion. The primary difference is that marking a Dataset for deletion will remove not only the Dataset but also its associated snapshots.

Work with datasets

Datasets and associated Snapshots from a project are automatically available in Domino executions (Workspaces, Jobs, Apps, and Launchers) at a predefined path that follows the conventions described below. You no longer have to use a domino.yaml configuration file to control mounting behavior, as in previous Domino releases.

The following configuration will demonstrate how it translates into paths that will be available in executions.

  • Dataset called clapton (local to the project)

    • Snapshot 1 (tagged with tag1)

    • Snapshot 2 (not tagged)

  • Dataset called mingus (local to project)

    • Snapshot 1 (tagged with tag2)

    • Snapshot 2 (not tagged)

  • Dataset called ella (shared from another project)

    • Snapshot 1 (tagged with tag3)

    • Snapshot 2 (not tagged)

  • Dataset called davis (shared from another project)

    • Snapshot 1 (tagged with tag4)

    • Snapshot 2 (not tagged)

Paths when using Git-based projects with CodeSync

For a Git-based project with CodeSync, the Datasets and Snapshots above will be available under the following hierarchy:

/mnt
   |--/data
     |--/clapton             <== R/W dataset
     |--/mingus              <== R/W dataset
     |--/snapshots           <== Snapshot folder organized by dataset
        |--/clapton          <== RO Snapshots for clapton dataset
           |--/tag1          <== Mounted under latest tag
           |--/1             <== Always mounted under the snapshot ID
           |--/2
        |--/mingus
           |--/tag2
           |--/1
           |--/2
   |--/imported
     |--/data
        |--/ella             <== RO shared dataset
        |--/davis            <== RO shared dataset
        |--/snapshots        <== Snapshot folder organized by dataset
           |--/ella          <== RO Snapshots for ella dataset
              |--/tag3       <== Mounted under latest tag
              |--/1          <== Always mounted under the snapshot ID
              |--/2
           |--/davis
              |--/tag4
              |--/1
              |--/2

The paths for all mounted Datasets and the root for any associated snapshots can always be seen in the Settings panel inside a Workspace or when launching an execution.

Datasets Mounting Launch Git Project

Paths when using Domino File System projects

For a Domino File System Based project, the Datasets and Snapshots above will be available under the following hierarchy:

/domino
   |--/datasets
      |--/local               <== local datasets and snapshots
         |--/clapton          <== R/W dataset
         |--/mingus           <== R/W dataset
         |--/snapshots        <== Snapshot folder organized by dataset
            |--/clapton       <== RO Snapshots for clapton dataset
            |--/tag1          <== Mounted under latest tag
            |--/1             <== Always mounted under the snapshot ID
            |--/2
         |--/mingus
            |--/tag2
            |--/1
            |--/2
      |--/ella                <== RO shared dataset
      |--/davis               <== RO shared dataset
      |--/snapshots           <== Shared datasets snapshots organized by dataset
         |--/ella
            |--/tag3          <== RO snapshot for ella dataset
            |--/1             <== Mounted under latest tag
            |--/2             <== Always mounted under the snapshot ID
         |--/davis
            |--/tag4
            |--/1
            |--/2

The paths for all mounted Datasets and the root for any associated snapshots can always be seen in the Settings panel inside a Workspace or when launching an execution.

Datasets Mounting Workspace DFS Project

Upgrade from versions prior to Domino 4.5

Domino 4.5+ brings several improvements to datasets. If you just upgraded from a version prior to Domino 4.5, the following information might be of particular interest.

Summary of changes

  • Datasets are now always read/write and reflect the latest version of the files. You can freely manipulate the contents of a dataset from Domino Workpaces, Jobs, Apps, and Launchers.

  • You can optionally create Read Only snapshots associated with a dataset. This is now an explicit action.

  • Datasets and associated Snapshots are automatically mounted for Domino executions. domino.yaml has been deprecated, and you no longer need to use it for dataset and snapshot mounting.

  • Scratch spaces, which were previously meant for convenient read/write iterations are also deprecated and are replaced with a new default per project dataset.

Migration considerations

While the above improvements are significant, any datasets and snapshots created with a prior version of Domino will be migrated seamlessly according to the following rules:

  • Datasets that did not have any snapshots previously will automatically become read/write.

  • Datasets with one or more snapshots will have the most recent snapshot promoted to a dataset and will automatically become read/write.

  • domino.yaml in existing projects will be ignored and datasets and snapshots will be mounted in executions based on the mounting rules described previously.

    Note
  • Scratch spaces with data in them will be promoted to a dataset. The Domino username of the user who owned the scratch space will be used as the name of the Dataset. Scratch spaces that are empty at the time of upgrade will not be migrated.

  • A new dataset with the same name as the project will be automatically created.

    Warning
Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.