Domino model APIs are scalable, high-availability REST services. Use the Deployment tab of the model settings page to configure the following items for your model:
-
The number of model hosts serving your model.
-
The compute resources available to your model hosts.
-
The number of routes (or versions) you want to expose.
You can scale each version of your model in the following ways:
-
Horizontally
Select the number of model instances that you want running at any given time. Domino automatically load-balances requests to the model endpoint between these instances. A minimum of two instances (default) provides a high-availability setup. Domino supports up to 32 instances per model.
-
Vertically
Select a hardware tier that determines the amount of RAM and CPU resources available to each model instances.
If you make changes to scale the model, you must restart it.
For R model APIs, set the Compute Resources per instance to at least 1 GB. The R model API container requires at least 720 GB RAM to run. Then, the other containers that get created such as Istio and fluentbit require additional memory.
Domino supports the following routing modes.
-
Basic mode
In this mode, you only have one route exposed that always points to the latest successfully deployed model version. When you deploy a new one, the old one is shut down and replaced with the new one while maintaining availability. The route has the following signature:
Latest:
/models/<modelId>/latest/model
-
Advanced mode
In this mode, you can have a promoted version and a latest version running. This allows you to have a workflow where your clients always point to the promoted version and you can test with the latest. When the latest version is ready for production, you can seamlessly switch it to be the promoted version with no downtime. The routes have the following signature:
Latest:
/models/<modelId>/latest/model
Promoted:
/models/<modelId>/labels/prod/model