Use model endpoint monitoring dashboards

Use these dashboards to monitor the health, reliability, and performance of your model APIs. Identify issues quickly and optimize model serving to ensure production-ready model endpoints that meet your SLA requirements.

As an IT administrator, you can treat model endpoints as first-class production services with comprehensive monitoring, alerting, and troubleshooting capabilities.

Important

The Cloud Administrator role is required to access Grafana workload monitoring dashboards. Regular users can view basic metrics through the Domino endpoint interface, but advanced monitoring and alerting features require administrator privileges.

Access the model endpoint monitoring dashboards

As a Cloud Administrator, you can access model endpoint monitoring through two methods:

Method 1: Direct Grafana access

Navigate to the Admin section in Domino.
Go to Reports > Grafana.
Select Dashboards from the Grafana interface.
Choose the Model Endpoints dashboard.
Use the filters to select your specific model ID and version.

Method 2: From model endpoint details

Navigate to Endpoints in your Domino Workspace.
Select the endpoint you want to monitor.
Click on Versions and select the specific version.
On the version details page, click the Grafana link (available for Domino administrators).
This will open the Grafana dashboard with the model ID and version automatically pre-selected.

Both methods provide access to the same comprehensive monitoring dashboards with detailed metrics and alerting capabilities.

Model health and reliability

Track status codes and success rates to identify errors, uptime issues, and model failures. Monitor authentication patterns and detect potential security issues or misconfigurations.

Panels: Model Performance Summary, HTTP Status Codes, Success Rate, Authentication Status

Key metrics
- Status code distribution (2xx, 4xx, 5xx)
- Success rate (excluding 401s)
- Request volumes and patterns
- Authentication success/failure rates
- API key usage patterns
What to watch
- Success rate <99%: Reliability concerns requiring immediate investigation
- High 4xx rates: Client-side misconfiguration or authentication issues
- High 5xx rates: Model or infrastructure failures
- Authentication failures: Potential security threats or credential issues
Targets and thresholds
- Success rate: >99% (excellent), 95–99% (acceptable), <95% (investigate immediately)
- 4xx errors: <1% of requests
- 5xx errors: <0.1% of requests
- Authentication success rate: >98%

Setting up alerts for health metrics

Configure alerts to proactively detect issues:

# High error rate alert
sum(rate(http_requests_total{status=~"5.*|4.*"}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

# Authentication failure alert
sum(rate(http_requests_total{status="401"}[5m])) / sum(rate(http_requests_total[5m])) > 0.02

Response time and latency

Measure model responsiveness and detect performance bottlenecks. Use latency percentiles to understand the user experience and identify models that need optimization.

Panels: Latency Percentiles, Request Times, Response Time Distribution

Key metrics
- P50, P90, P95, P99 latency percentiles
- Upstream vs total response time differentiation
- Response time trends over time
Understanding the metrics
- P50 (median): Represents typical user experience
- P95/P99: Captures worst-case scenarios and outliers
- Upstream time: Time spent in model inference
- Total time: Includes network overhead and request processing
Focus areas
- High P50: Core model slowness affecting all users
- High P95/P99: Sporadic performance issues affecting some users
- Increasing trends: Performance degradation requiring intervention
- Large upstream vs total gaps: Network or infrastructure bottlenecks

Response time alerting

Configure alerts for response time degradation:

# Slow responses alert (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, model_id, version)) > 2

# Median response time alert
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, model_id, version)) > 1

Traffic patterns and load

Monitor request volume and distribution to identify spikes, uneven traffic patterns, and scaling needs. Understand usage patterns to proactively prepare for peak loads and optimize resource allocation.

Panels: Request Volume, Model Paths, Request Rate, Traffic Distribution

Key indicators
- Requests per second (RPS) and total request volumes
- Path-level distribution and endpoint usage patterns
- Load balancing efficiency across model instances
- Peak usage times and traffic trends
Actionable insights
- Prepare for peak loads: Identify usage patterns to pre-scale resources
- Address skewed distribution: Balance traffic across model instances
- Track growth trends: Plan capacity based on historical usage
- Detect anomalies: Identify unusual traffic spikes or drops

Traffic-based alerts

Monitor for unusual traffic patterns:

# High request rate alert
sum(rate(http_requests_total[5m])) by (model_id) > 100

# Traffic spike detection
increase(http_requests_total[10m]) > 1000

Resource utilization and performance

Check pod health, CPU, memory, and network usage to detect resource constraints and failures. Monitor resource efficiency to optimize costs and ensure reliable model serving.

Panels: Pods, Container Restarts, CPU Utilization, Memory Usage, Network I/O

Critical metrics
- CPU usage (average and peak utilization)
- Memory consumption and allocation patterns
- Pod restart frequency and reasons
- Network throughput and bandwidth usage
Target thresholds
- CPU usage: <70% average, <90% peak for sustained periods
- Memory usage: <80% of allocated limits
- Container restarts: <1 restart per day per pod
- Network I/O: Monitor for saturation during peak traffic
Optimization opportunities
- Scale resources based on actual usage patterns
- Address frequent restarts indicating stability issues
- Monitor network throughput for bottlenecks
- Right-size resource allocations to reduce costs

Resource utilization alerts

Monitor resource constraints:

# High CPU usage alert
avg(rate(process_cpu_seconds_total{job="model-endpoints"}[5m])) by (model_id, version) > 0.8

# High memory usage alert
avg(process_resident_memory_bytes{job="model-endpoints"}) by (model_id, version) / avg(container_spec_memory_limit_bytes) by (model_id, version) > 0.8

# Container restart alert
increase(kube_pod_container_status_restarts_total[1h]) > 0

Model-specific performance

Compare performance across models and versions to identify underperformers and optimize resource allocation. Use these insights to make data-driven decisions about model deployment and resource management.

Panels: Response Times Table, Response Size, Request Breakdown, Model Comparison

Comparative metrics
- Latency per model and version
- Payload sizes and data transfer patterns
- Request frequency and error rates by model
- Resource efficiency ratios (requests per CPU/memory unit)
Insights for optimization
- Identify slow models: Spot models consistently performing below thresholds
- Track performance regressions: Compare versions to detect degradation
- Allocate resources per model: Right-size resources based on actual usage
- Cost optimization: Identify high-resource, low-value models

Troubleshooting common issues

Use these dashboards to diagnose and resolve common model endpoint problems:

High error rates

Check the HTTP Status Codes panel for error distribution
Correlate with recent deployments or traffic changes
Review authentication metrics for 401 errors
Examine resource usage for 5xx errors caused by resource exhaustion

Performance degradation

Compare current latency percentiles with historical baselines
Check CPU and memory usage for resource constraints
Review request patterns for traffic spikes
Analyze individual model performance for regressions

Resource issues

Monitor container restart patterns for stability problems
Check CPU and memory trending for capacity planning
Review network I/O for bandwidth limitations
Examine pod health for infrastructure issues

Best practices for IT administrators

Daily monitoring checklist

Review HTTP status code distributions across all models
Check resource usage patterns and identify anomalies
Verify authentication success rates and security metrics
Monitor alert status and resolve any active incidents

Weekly performance review

Analyze performance trends and identify degradation patterns
Review models with resource inefficiencies
Check alert history and tune thresholds if needed
Collaborate with data science teams on optimization opportunities

Monthly maintenance tasks

Update alert thresholds based on historical performance data
Fine-tune dashboards for emerging usage patterns
Review access logs and security metrics
Plan capacity based on growth trends

Alert strategy recommendations

Set up tiered alerting (warning, critical, emergency)
Configure different notification channels for different severity levels
Implement alert correlation to reduce noise
Document runbooks for common alert scenarios

Next steps

Use execution monitoring dashboards to track workload performance and identify issues early, and optimize execution across your deployment.
Run data plane agent monitoring dashboards to track the health and performance of Domino Data Plane Agents to monitor execution management and Kubernetes resource activity.
Monitor model endpoint performance (User Guide) will help data scientists understand how to use monitoring during model development and optimization.

User Guide

Admin Guide

API Guide

Release Notes