Domino has configurable values to help you tune your cluster to balance performance with cost controls. The more idle volumes you allow the more likely it is that users can reuse a volume and avoid needing to copy project files from the blob store. However, this comes at the cost of keeping additional idle PVs.
By default, Domino will:
-
Limit the total number of idle PVs to 32. This can be adjusted by setting the following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxIdle
-
Terminate any idle PV that has not been used in a certain number of days. This can be adjusted by setting the following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxAge
This value is expressed in terms of days. The default value is empty, which means unlimited. A value of7d
will terminate any idle PV after seven days.
In the scenario when a user’s job fails unexpectedly, Domino will preserve the volume so data can be recovered.
After a workspace or job ends, claimed PV’s are placed into one of the following states, indicated with the
dominodatalab.com/volume-state
label.
-
available
If the run ends normally, the underlying PV will be available for future runs.
-
salvaged
If the run fails, the underlying PV will not be eligible for reuse, and is held in this state to be salvaged.
Salvaged PVs will not be reused automatically by the future workspaces or jobs, but can be manually mounted to a workspace to recover work.
By default, Domino will:
-
Limit the total number of salvaged PVs to 64. This can be adjusted by setting the following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvaged
-
Terminate any salvaged PV that has not been used in a certain number of days. This can be adjusted by setting the following option in the central config:
common com.cerebro.domino.computegrid.kubernetes.volume.maxSalvagedAge
The value is expressed in terms of days. The default value is seven days. A value of
14d
will terminate any salvaged PV after fourteen days.
-
Find the PV that was attached to your job or workspace, which will be in the Deployment logs for your job or workspace.
-
Create a pod attached to the salvaged volume.
-
Recover the files with your most convenient method (
scp
, AWS CLI,kubectl cp
, and so on)
This script will do Step 2 and will provide the appropriate commands in its output. Remember to delete the PVC and PV, otherwise these resources will continue to be used.