Datasets scratch spaces




Overview

A Datasets Scratch Space is a scalable mutable (i.e. read-writeable) filesystem directory for temporary data storage and exploration. They are a compliment to the core Datasets functionality. They provide a space to keep intermediate data results or candidates for Dataset Snapshots as you explore your data. These spaces are designed for when you don’t know what you want just yet (i.e. you don’t know what you don’t know). These spaces are automatically mounted for Workspace sessions for each User for every Project (i.e. they are Per User Per Project). Here are some key properties:

  1. They do not preserve reproducibility.  Files placed in a Datasets Scratch Space are not versioned or tracked.  A Datasets Scratch Space is simply a long-lived directory with the scalable properties of Datasets (i.e. large file sizes and many individual files).
  2. They are only available for Workspaces.
  3. You get a unique Datasets Scratch Space per User per Project.
  4. If you shutdown and launch Workspaces, the Datasets Scratch Space is exactly as you left it.  All the contents remain; unless you promote the contents to a Dataset Snapshot.
  5. If you spin up multiple, concurrent Workspaces in a Project, all those Workspaces will see the same Datasets Scratch Space.  Remember, only you can “see” your Scratch Space, so any potentially file locks are happening by you.
  6. When no Workspaces are running, the contents of the Datasets Scratch Space can be promoted to a Snapshot of a Dataset within the Project.



Finding my Scratch Space

For a given project (with name project-name), your Datasets Scratch Space for that Project will be at: /domino/datasets/{username}/{project-name}/scratch, where username is your Domino username. Remember, you only get Scratch Spaces with Workspaces.

Examples

Let’s say for these examples, my username is dara-data.

  1. Project: big-data, Owner: alex-algo

    My Scratch Space is at:

    /domino/datasets/dara-data/big-data/scratch

    Notice that it doesn’t matter than the owner is alex-algo. The scratch path is still with my username. Also, it is assumed I have the appropriate permissions to the Project.

  2. Project: big-data, Owner: dara-data

    My Scratch Space is at:

    /domino/datasets/dara-data/big-data/scratch

    Notice that it doesn’t matter that there are other Projects with the name big-data. They won’t conflict because the scope of the path is only for Workspaces in that Project.




Seeing the contents of my Datasets Scratch Space

You can always view the contents of your Datasets Scratch Space by launching a Workspace and navigating to your Scratch Space. It is simply another directory (with all the high performance properties of Datasets).

  1. File Browser.

    If you’d like to get an idea of the contents of the Scratch Space without using a Workspace, you can navigate to the Datasets Project Level Page.

    Datasets Scratch Space File Browser

    There you will find a file browser that displays the contents of your Scratch Space. If you have lots of files, you can paginate through. You can also drill down into any folders that are in your Scratch Space. As you modify the contents of your Scratch Space (e.g. add/delete/edit files, add/delete/edit folders), refreshing the file browser will reflect those changes.

    file-pagination-cropped.png

  2. Calculated Size.

    The used Scratch Space size is calculated anytime a Workspace is stopped. The timestamp of the calculation is also provided.

    In this example, I have two Workspaces open, and my current calculation is 10 GB. This actually doesn’t reflect the 12 GB in the transactions folder that I created since the last time a Workspace was closed.

    Outdated Calculated Size

    When I close one of the Workspaces, the used Scratch Space size is updated to 22 GB.

    Updated Calculated Size




Promoting Scratch Space Contents to a Dataset Snapshot

The contents of your Scratch Space can be made into a Dataset Snapshot.

  1. No Workspaces can be running to create a Dataset Snapshot from the contents of your Scratch Space. The Create a Dataset Snapshot from Scratch Space is only enabled when no Workspaces are running.
  2. You will only be able to promote to a Dataset within your Project.
  3. Once the contents of the Space Scratch is made into a Dataset Snapshot, contents of the Scratch Space will be deleted (i.e. the Scratch Space is cleared)

Workflow

  1. Click Create a Dataset Snapshot from Scratch Space

    file-browser-cropped.png

  2. Using the input box in the modal, select a Dataset in your Project.

    1. WARNING: When you create a Dataset Snapshot from the contents of the Datasets Scratch Space, those contents will no longer be in the Datasets Scratch Space (i.e. The contents of the Datasets Scratch Space will be deleted upon promotion to a Dataset Snapshot).
    2. Click Create Snapshot

    snapshot-promotion-cropped.png

  3. To confirm, you can navigate to the details page of your newly created Snapshot.

    snapshot-from-promotion-details-cropped.png




Risk Notifications

Using a Datasets Scratch Space for indefinite storage is discouraged. Not only does this potentially lead to wasteful storage consumption and costs, it may also be unnecessarily compromising reproducibility. Finally, while a Datasets Scratch Space is reliable persistent storage, an accidental loss could occur from a User accidentally deleting contents from the Scratch Space. Creating a Snapshot of valuable work from the Scratch Space prevents accidental loss.

To mitigate the risk of using Scratch Spaces for indefinite storage, a user interface indicator provides a Scratch Space risk notification that alerts the user of the risk associated with the contents in the Scratch Space. Specifically, we want to notify the user of the length of time since potentially new work has not been made into a Snapshot. Here, the length of time is a proxy for risk profile.

By default, the three ranges are:

  1. Low Risk: Less than or equal to five days.
  2. Medium Risk: Greater than five days and less than or equal to ten days.
  3. High Risk: Greater than ten days.

There are three risk ranges, separated by two thresholds; hence, the thresholds define the ranges. The risk ranges are in terms of days and have a lower bound of zero days and no upper bound. The thresholds are configurable via two central configuration values.


Namespace: common
Keycom.cerebro.domino.dataset.scratch.riskThresholdOneInDays
Value: number
Default: 5

This option controls the first Datasets Scratch Space risk threshold in days.

Namespace: common
Keycom.cerebro.domino.dataset.scratch.riskThresholdTwoInDays
Value: number
Default: 10

This option controls the second Datasets Scratch Space risk threshold in days.


Administration

The administration page for Dataset Scratch Spaces is the same of as the Datasets administration page. Once in the Administration area, you can navigate by selecting Datasets under the Advanced menu option.

Datasets Admin Nav

A Dataset Scratch Spaces administration section is below the Datasets administration section. A table of all Dataset Scratch Spaces is shown.

Datasets Scratch Space Admin

  1. Project

    Unique identifier which is a concatenation of:

    {owner_username}/{project_name}

  2. User

    Full name of user.

  3. Size

    Used Scratch Space Size.

  4. Last Time Size Calculated

    Timestamp of when the value in the Size column was calculated.

  5. Last Time Snapshot

    The last time the contents of that Scratch Space was promoted into a Dataset Snapshot. No value means the contents of the Scratch Space has never been made into a Dataset Snapshot.


Filtering

A filtering input box is provided at the top of the administration table. The filter will do a case insensitive string match on all the columns in the administration table and return a table with rows with matching elements (i.e. rows that contain the filtering string).

Datasets Scratch Space Admin Filtering


Deleting Scratch Space Contents

Administrators can delete the contents of a Datasets Scratch Space. Only Scratch Spaces with files are eligible to be deleted; this is the case even if the files are zero bytes. The Delete Contents button for Scratch Spaces that contain no files will be disabled.


Workflow

  1. Click Delete Contents for the Datasets Scratch Space you want to delete the contents of.

    delete-contents-cropped.png

  2. This will bring up a confirmation modal. If you are sure you want to delete the contents, press Delete Contents. Deleting the contents of a Scratch Space cannot be undone.

    delete-contents-confirm-cropped.png

  3. The administration table will be refreshed and you can confirm that the contents of your selected Datasets Scratch Space are deleted.

    datasets-scratch-space-admin-deleted-cropped.png




Non-Empty Length of Time

The length of time used to notify the user is specifically the time the Scratch Space has been non-empty. Recall the following:

  1. The storage size of the Datasets Scratch Space contents is computed after the stopping of any Workspace.
  2. There are three cases the Scratch Space becomes empty.
    1. Initial state
    2. User clears it (e.g. rm -rf *)
    3. Promot to Snapshot

Consider the following figure, which illustrates the state of a Datasets Scratch Space over time. The horizontal axis is time. Portions where the Datasets Scratch Space is non-empty is shown in orange. A non-empty Scratch Space is one where there exists at least one file (of any size). At times t_2 and t_4, the Scratch Space is cleared (emptied) by the User and through a promotion to a Snapshot, respectively.

Non-Empty Time

Assume we are at the moment in time marked “Now” and we’ve closed a Workspace, and hence, the Scratch Space storage size is computed. The non-empty time we report will be that moment in time back to t5, the most recent point Scratch Space storage was calculated to be non-empty coming from the empty state. This non-empty time is specified as T_NON-EMPTY.




FAQ

  1. I tried to write to a Datasets Scratch Space in a Job and it failed?
    A Datasets Scratch Space is only available in Workspace sessions.
  2. Why is the stated “Used Scratch Space Size” in the Datasets Scratch Space file browser different from what I expected based on the files that are currently in my Datasets Scratch Space?
    The “Used Scratch Space Size” (size calculation) is not updated in a real-time fashion. Instead, it is calculated every time a Workspace is closed. Notice that with every stated “Used Scratch Space Size”, a note is presented on when the last time the value was calculated. See “Why is the size calculation in the file browser only updated when a Workspace is closed?”.
  3. Why is the size calculation in the file browser only updated when a Workspace is closed?

    Constantly having a process periodically calculating a file system size for open Workspaces can be taxing on the system. Remember, the Scratch Space is designed to have all the performance properties of Datasets: it can handle large data (i.e. TBs) and many number of individual files (i.e. millions of files). Also, if Workspaces are already open, Users can inspect the filesystem and sizes directly. Finally, the convenience of the file browser really comes into play when no Workspaces are running, which at that time, everything is static and up-to-date.

  4. Why the “Create a Dataset Snapshot from Scratch Space” button disabled sometimes?

    You can only snapshot a Datasets Scratch Space when no Workspaces are running.

  5. Why are the contents of the Datasets Scratch Space deleted upon promotion to a Dataset Snapshot?

    This is for performance reasons. Like Datasets, Scratch Spaces can potentially contain large sized files and a large number of individual files. In order to create a Dataset Snapshot and preserve the contents of the Scratch Spaces, we would need to perform an expensive and robust copy operation (i.e. long time and computation resources). By allowing the Scratch Space to be cleared upon promoting its contents to a Datasets Snapshot, we are able to cleverly perform the operation instantly. Both the Scratch Space and newly created Dataset Snapshot are available for use after promotion.

  6. I want to promote my Scratch Space to a Dataset that doesn’t exist yet. How do I create a Dataset?

    You can create an empty Dataset as described in the Creating Datasets Support article.

  7. What if I want to take the contents of my Dataset Scratch Space and make it into a Snapshot of a Dataset not in my current Project?

    You would have to do this manually and just treat the Scratch Space like a directory you’d like to copy or move into new Snapshot directory.

    First, you must have access to create a new Snapshot to a Project (see Sharing and Collaboration).

    Assuming you have permission, you would need to create a configuration in your `domino.yaml file where you mount an output directory for your desired Dataset. Because the Dataset is not in your Project, you will have to refer to the Dataset in a fully qualified way: {project_owner_username}/{project_name}/{dataset_set}.

    So, for example, if I wanted to mount an output directory so that I could create a new Snapshot to a Dataset called iris in a Project called datascience owned by john_smith, I should have a YAML entry that looks something like:

    datasetConfigurations:
      - name: “new iris snapshot”
        outputs:
        - path: “new_iris”
          dataset: “john_smith/datascience/iris”
    

    Here, I called the configuration new iris snapshot and I chose a mount point of new_iris; both this are up to you.

    Once you’ve properly mounted the output directory for your new Snapshot, you can simply launch a Run and copy or move the contents of your Scratch Space to the output directory.

  8. What Dataset Scratch Spaces are shown in the administration table?

    All Dataset Scratch Spaces are shown in administration table for Dataset Scratch Spaces. A Dataset Scratch Spaces becomes “active” any time a User starts a Workspace in a Project. Hence, there is a Dataset Scratch Space for any Project that a User starts a Workspace.

  9. Why am I allowed to delete the contents of a Datasets Scratch Space that uses zero bytes?

    Scratch Spaces with any files are eligible to be deleted; this is the case even if the files are zero bytes.