Domino makes it easy to save money while training machine learning models using Domino hardware tiers based on AWS Spot Instances. Amazon Spot Instances give you access to unused AWS EC2 compute instances at a discount over their on-demand prices.
It’s important to note that Spot Instances can be interrupted, causing Jobs to take longer to start or finish, or even fail during the Job if the Spot Instance is reclaimed. Spot Instances are best used for various fault-tolerant and flexible applications.
To mitigate the impact of losing progress of a model training Job, we recommend you use model checkpointing. Domino automatically reloads the model checkpoints and mounts them to a local file when the Job is restarted, enabling the training Job to resume from the last checkpoint instead of restarting completely.
If AWS interrupts a Spot Instance, the Job may fail to start or complete. If this happens, we suggest retrying the Job after some duration (more than 30 minutes). If the issue persists, the remediation is to change the hardware tier of the Job to use a non-spot node pool.
Domino Jobs run in containers, so model checkpointing files are saved under a local directory of the containers (we recommend that you save checkpoints at /mnt/artifacts/checkpoints/
). Domino synchronizes artifacts, including model checkpoints, to external storage. Existing checkpoints in S3 are mounted to the container at the start of the Job, enabling Jobs to resume from a checkpoint. Checkpoints added to the S3 folder after the Job has started are not copied to the training container.
Use checkpoints to do the following:
-
Save your model snapshots under training due to an unexpected interruption to the training Job or instance.
-
Resume training the model in the future from a checkpoint.
-
Analyze the model at intermediate stages of training.
-
Use checkpoints with Spot Instances to save on training costs.
Checkpointing considerations
Consider the following when using checkpoints in Domino:
-
To avoid overwrites in distributed training with multiple instances, you must manually configure the checkpoint file names and paths in your training script accounting for the rank of each worker node in your cluster.
-
To control the checkpointing frequency, modify your training script using the framework’s model save functions or checkpoint callbacks.
Learn how to create a node pool with Spot Instances.