Keep an eye on your deployed models with model API health checks and logs. Domino can make sure that your endpoints are running and can alert you when they are down. Use logs to troubleshoot and audit your model APIs.
Domino monitors every model API’s health and ability to respond to new inference requests. When you update the health check settings, the model API automatically restarts.
-
Navigate to Endpoints.
-
Select a model API, then adjust the fields in Settings > Advanced:
- Initial delay
-
The time (in seconds) that Domino waits before a new model API can receive incoming requests. Change the value of this setting to delay the initialization of a model API.
- Health check period
-
How often (in seconds) Domino checks the model API health. Health check period x Failure threshold must be greater than the Override request timeout from the timeout settings.
- Timeout settings
-
The time (in seconds) that Domino lets an inference request take before timing it out. In the timeout case, Domino responds with
504 Gateway Timeout
. The default is 60 seconds. You must restart the model API for timeout setting changes to take effect. - Health check timeout
-
The length of time (in seconds) that Domino waits before it considers a health check request as failed.
- Failure threshold
-
If this number of consecutive health check requests fails, Domino considers the model API instance unrecoverable and restarts it.
Domino offers multiple logs for troubleshooting and auditing your model APIs.
Check the Logs column for a specific model API version to view build, export, instance, or deployment logs.
-
Build Logs - Events that happened to build the image. See the build definition and metadata needed to complete the build.
-
Export Logs - Export details for the model API.
-
Instance Logs - Logs related to individual containers for a given model API instance. View all model APIs and all containers or filter the information by model API and container.
-
Deployment Logs - Chronological events related to the deployment. These events include heartbeats, jobs, deployments, and Kubernetes events. Inspect payloads that contain pod and status information. Container status information identifies where images are in the deployment and indicates their state.