Use these dashboards to monitor the health, reliability, and performance of your model APIs. Identify issues quickly and optimize model serving to ensure production-ready model endpoints that meet your SLA requirements.
As an IT administrator, you can treat model endpoints as first-class production services with comprehensive monitoring, alerting, and troubleshooting capabilities.
Important
| The Cloud Administrator role is required to access Grafana workload monitoring dashboards. Regular users can view basic metrics through the Domino endpoint interface, but advanced monitoring and alerting features require administrator privileges. |
As a Cloud Administrator, you can access model endpoint monitoring through two methods:
Method 2: From model endpoint details
-
Navigate to Endpoints in your Domino Workspace.
-
Select the endpoint you want to monitor.
-
Click on Versions and select the specific version.
-
On the version details page, click the Grafana link (available for Domino administrators).
-
This will open the Grafana dashboard with the model ID and version automatically pre-selected.
Both methods provide access to the same comprehensive monitoring dashboards with detailed metrics and alerting capabilities.
Track status codes and success rates to identify errors, uptime issues, and model failures. Monitor authentication patterns and detect potential security issues or misconfigurations.
Panels: Model Performance Summary, HTTP Status Codes, Success Rate, Authentication Status
-
Key metrics
-
Status code distribution (2xx, 4xx, 5xx)
-
Success rate (excluding 401s)
-
Request volumes and patterns
-
Authentication success/failure rates
-
API key usage patterns
-
-
What to watch
-
Success rate <99%: Reliability concerns requiring immediate investigation
-
High 4xx rates: Client-side misconfiguration or authentication issues
-
High 5xx rates: Model or infrastructure failures
-
Authentication failures: Potential security threats or credential issues
-
-
Targets and thresholds
-
Success rate: >99% (excellent), 95–99% (acceptable), <95% (investigate immediately)
-
4xx errors: <1% of requests
-
5xx errors: <0.1% of requests
-
Authentication success rate: >98%
-
Setting up alerts for health metrics
Configure alerts to proactively detect issues:
# High error rate alert
sum(rate(http_requests_total{status=~"5.*|4.*"}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
# Authentication failure alert
sum(rate(http_requests_total{status="401"}[5m])) / sum(rate(http_requests_total[5m])) > 0.02
Measure model responsiveness and detect performance bottlenecks. Use latency percentiles to understand the user experience and identify models that need optimization.
Panels: Latency Percentiles, Request Times, Response Time Distribution
-
Key metrics
-
P50, P90, P95, P99 latency percentiles
-
Upstream vs total response time differentiation
-
Response time trends over time
-
-
Understanding the metrics
-
P50 (median): Represents typical user experience
-
P95/P99: Captures worst-case scenarios and outliers
-
Upstream time: Time spent in model inference
-
Total time: Includes network overhead and request processing
-
-
Focus areas
-
High P50: Core model slowness affecting all users
-
High P95/P99: Sporadic performance issues affecting some users
-
Increasing trends: Performance degradation requiring intervention
-
Large upstream vs total gaps: Network or infrastructure bottlenecks
-
Response time alerting
Configure alerts for response time degradation:
# Slow responses alert (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, model_id, version)) > 2
# Median response time alert
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, model_id, version)) > 1
Monitor request volume and distribution to identify spikes, uneven traffic patterns, and scaling needs. Understand usage patterns to proactively prepare for peak loads and optimize resource allocation.
Panels: Request Volume, Model Paths, Request Rate, Traffic Distribution
-
Key indicators
-
Requests per second (RPS) and total request volumes
-
Path-level distribution and endpoint usage patterns
-
Load balancing efficiency across model instances
-
Peak usage times and traffic trends
-
-
Actionable insights
-
Prepare for peak loads: Identify usage patterns to pre-scale resources
-
Address skewed distribution: Balance traffic across model instances
-
Track growth trends: Plan capacity based on historical usage
-
Detect anomalies: Identify unusual traffic spikes or drops
-
Check pod health, CPU, memory, and network usage to detect resource constraints and failures. Monitor resource efficiency to optimize costs and ensure reliable model serving.
Panels: Pods, Container Restarts, CPU Utilization, Memory Usage, Network I/O
-
Critical metrics
-
CPU usage (average and peak utilization)
-
Memory consumption and allocation patterns
-
Pod restart frequency and reasons
-
Network throughput and bandwidth usage
-
-
Target thresholds
-
CPU usage: <70% average, <90% peak for sustained periods
-
Memory usage: <80% of allocated limits
-
Container restarts: <1 restart per day per pod
-
Network I/O: Monitor for saturation during peak traffic
-
-
Optimization opportunities
-
Scale resources based on actual usage patterns
-
Address frequent restarts indicating stability issues
-
Monitor network throughput for bottlenecks
-
Right-size resource allocations to reduce costs
-
Resource utilization alerts
Monitor resource constraints:
# High CPU usage alert
avg(rate(process_cpu_seconds_total{job="model-endpoints"}[5m])) by (model_id, version) > 0.8
# High memory usage alert
avg(process_resident_memory_bytes{job="model-endpoints"}) by (model_id, version) / avg(container_spec_memory_limit_bytes) by (model_id, version) > 0.8
# Container restart alert
increase(kube_pod_container_status_restarts_total[1h]) > 0
Compare performance across models and versions to identify underperformers and optimize resource allocation. Use these insights to make data-driven decisions about model deployment and resource management.
Panels: Response Times Table, Response Size, Request Breakdown, Model Comparison
-
Comparative metrics
-
Latency per model and version
-
Payload sizes and data transfer patterns
-
Request frequency and error rates by model
-
Resource efficiency ratios (requests per CPU/memory unit)
-
-
Insights for optimization
-
Identify slow models: Spot models consistently performing below thresholds
-
Track performance regressions: Compare versions to detect degradation
-
Allocate resources per model: Right-size resources based on actual usage
-
Cost optimization: Identify high-resource, low-value models
-
Use these dashboards to diagnose and resolve common model endpoint problems:
-
Use execution monitoring dashboards to track workload performance and identify issues early, and optimize execution across your deployment.
-
Run data plane agent monitoring dashboards to track the health and performance of Domino Data Plane Agents to monitor execution management and Kubernetes resource activity.
-
Monitor model endpoint performance (User Guide) will help data scientists understand how to use monitoring during model development and optimization.