Domino model APIs are scalable, high-availability REST services. Use the Deployment tab of the model settings page to configure the following items for your model:
-
The number of model hosts serving your model.
-
The compute resources available to your model hosts.
-
The number of routes (or versions) you want to expose.
API requests are handled sequentially due to limitations of the Python and R runtimes. If your code processes requests slowly, this can cause other requests in the queue to timeout. When scaling your model, consider whether your code is memory intensive, CPU intensive, and so on.
Scale all models (Python only)
-
As an administrator, go to Advanced > Central Config.
-
If the
com.cerebro.domino.modelmanager.uWsgi.workerCount config
key is not listed on the Configuration Management page, click Add Record.-
At the end of the list, in the key column, enter
com.cerebro.domino.modelmanager.uWsgi.workerCount
.
-
-
If the key is already listed, click the pencil icon to change its value.
-
-
In the value column, set the key to a value greater than its default value of
1
to increase the uWSGI worker count. See https://uwsgi-docs.readthedocs.io/en/latest/ for more information. -
Click Create.
-
Under the Configuration Management page title, the system shows the following message: “Changes here do not take effect until services are restarted. Click here to restart services.” Click the here link to restart the services.
Scale model versions
You can scale each version of your model in the following ways:
-
Horizontally
Select the number of model instances that you want running at any given time. Domino automatically load-balances requests to the model endpoint between these instances. A minimum of two instances (default) provides a high-availability setup. Domino supports up to 32 instances per model.
-
Vertically
Select a hardware tier that determines the amount of RAM and CPU resources available to each model instances.
If you make changes to scale the model, you must restart it.
For R model APIs, set the Compute Resources per instance to at least 1 GB. The R model API container requires at least 720 GB RAM to run. Then, the other containers that get created such as Istio and fluentbit require additional memory.
Domino supports the following routing modes.
-
Basic mode
In this mode, you only have one route exposed that always points to the latest successfully deployed model version. When you deploy a new one, the old one is shut down and replaced with the new one while maintaining availability. The route has the following signature:
Latest:
/models/<modelId>/latest/model
-
Advanced mode
In this mode, you can have a promoted version and a latest version running. This allows you to have a workflow where your clients always point to the promoted version and you can test with the latest. When the latest version is ready for production, you can seamlessly switch it to be the promoted version with no downtime. The routes have the following signature:
Latest:
/models/<modelId>/latest/model
Promoted:
/models/<modelId>/labels/prod/model