Use model endpoint monitoring dashboards

Use these dashboards to monitor the health, reliability, and performance of your model APIs. Identify issues quickly and optimize model serving to ensure production-ready model endpoints that meet your SLA requirements.

As an IT administrator, you can treat model endpoints as first-class production services with comprehensive monitoring, alerting, and troubleshooting capabilities.

Important
The Cloud Administrator role is required to access Grafana workload monitoring dashboards. Regular users can view basic metrics through the Domino endpoint interface, but advanced monitoring and alerting features require administrator privileges.

Access the model endpoint monitoring dashboards

As a Cloud Administrator, you can access model endpoint monitoring through two methods:

Method 1: Direct Grafana access

  1. Navigate to the Admin section in Domino.

  2. Go to Reports > Grafana.

  3. Select Dashboards from the Grafana interface.

  4. Choose the Model Endpoints dashboard.

  5. Use the filters to select your specific model ID and version.

Method 2: From model endpoint details

  1. Navigate to Endpoints in your Domino Workspace.

  2. Select the endpoint you want to monitor.

  3. Click on Versions and select the specific version.

  4. On the version details page, click the Grafana link (available for Domino administrators).

  5. This will open the Grafana dashboard with the model ID and version automatically pre-selected.

Both methods provide access to the same comprehensive monitoring dashboards with detailed metrics and alerting capabilities.

Model health and reliability

Track status codes and success rates to identify errors, uptime issues, and model failures. Monitor authentication patterns and detect potential security issues or misconfigurations.

Panels: Model Performance Summary, HTTP Status Codes, Success Rate, Authentication Status

  • Key metrics

    • Status code distribution (2xx, 4xx, 5xx)

    • Success rate (excluding 401s)

    • Request volumes and patterns

    • Authentication success/failure rates

    • API key usage patterns

  • What to watch

    • Success rate <99%: Reliability concerns requiring immediate investigation

    • High 4xx rates: Client-side misconfiguration or authentication issues

    • High 5xx rates: Model or infrastructure failures

    • Authentication failures: Potential security threats or credential issues

  • Targets and thresholds

    • Success rate: >99% (excellent), 95–99% (acceptable), <95% (investigate immediately)

    • 4xx errors: <1% of requests

    • 5xx errors: <0.1% of requests

    • Authentication success rate: >98%

Setting up alerts for health metrics

Configure alerts to proactively detect issues:

# High error rate alert
sum(rate(http_requests_total{status=~"5.*|4.*"}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

# Authentication failure alert
sum(rate(http_requests_total{status="401"}[5m])) / sum(rate(http_requests_total[5m])) > 0.02

Response time and latency

Measure model responsiveness and detect performance bottlenecks. Use latency percentiles to understand the user experience and identify models that need optimization.

Panels: Latency Percentiles, Request Times, Response Time Distribution

  • Key metrics

    • P50, P90, P95, P99 latency percentiles

    • Upstream vs total response time differentiation

    • Response time trends over time

  • Understanding the metrics

    • P50 (median): Represents typical user experience

    • P95/P99: Captures worst-case scenarios and outliers

    • Upstream time: Time spent in model inference

    • Total time: Includes network overhead and request processing

  • Focus areas

    • High P50: Core model slowness affecting all users

    • High P95/P99: Sporadic performance issues affecting some users

    • Increasing trends: Performance degradation requiring intervention

    • Large upstream vs total gaps: Network or infrastructure bottlenecks

Response time alerting

Configure alerts for response time degradation:

# Slow responses alert (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, model_id, version)) > 2

# Median response time alert
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, model_id, version)) > 1

Traffic patterns and load

Monitor request volume and distribution to identify spikes, uneven traffic patterns, and scaling needs. Understand usage patterns to proactively prepare for peak loads and optimize resource allocation.

Panels: Request Volume, Model Paths, Request Rate, Traffic Distribution

  • Key indicators

    • Requests per second (RPS) and total request volumes

    • Path-level distribution and endpoint usage patterns

    • Load balancing efficiency across model instances

    • Peak usage times and traffic trends

  • Actionable insights

    • Prepare for peak loads: Identify usage patterns to pre-scale resources

    • Address skewed distribution: Balance traffic across model instances

    • Track growth trends: Plan capacity based on historical usage

    • Detect anomalies: Identify unusual traffic spikes or drops

Traffic-based alerts

Monitor for unusual traffic patterns:

# High request rate alert
sum(rate(http_requests_total[5m])) by (model_id) > 100

# Traffic spike detection
increase(http_requests_total[10m]) > 1000

Resource utilization and performance

Check pod health, CPU, memory, and network usage to detect resource constraints and failures. Monitor resource efficiency to optimize costs and ensure reliable model serving.

Panels: Pods, Container Restarts, CPU Utilization, Memory Usage, Network I/O

  • Critical metrics

    • CPU usage (average and peak utilization)

    • Memory consumption and allocation patterns

    • Pod restart frequency and reasons

    • Network throughput and bandwidth usage

  • Target thresholds

    • CPU usage: <70% average, <90% peak for sustained periods

    • Memory usage: <80% of allocated limits

    • Container restarts: <1 restart per day per pod

    • Network I/O: Monitor for saturation during peak traffic

  • Optimization opportunities

    • Scale resources based on actual usage patterns

    • Address frequent restarts indicating stability issues

    • Monitor network throughput for bottlenecks

    • Right-size resource allocations to reduce costs

Resource utilization alerts

Monitor resource constraints:

# High CPU usage alert
avg(rate(process_cpu_seconds_total{job="model-endpoints"}[5m])) by (model_id, version) > 0.8

# High memory usage alert
avg(process_resident_memory_bytes{job="model-endpoints"}) by (model_id, version) / avg(container_spec_memory_limit_bytes) by (model_id, version) > 0.8

# Container restart alert
increase(kube_pod_container_status_restarts_total[1h]) > 0

Model-specific performance

Compare performance across models and versions to identify underperformers and optimize resource allocation. Use these insights to make data-driven decisions about model deployment and resource management.

Panels: Response Times Table, Response Size, Request Breakdown, Model Comparison

  • Comparative metrics

    • Latency per model and version

    • Payload sizes and data transfer patterns

    • Request frequency and error rates by model

    • Resource efficiency ratios (requests per CPU/memory unit)

  • Insights for optimization

    • Identify slow models: Spot models consistently performing below thresholds

    • Track performance regressions: Compare versions to detect degradation

    • Allocate resources per model: Right-size resources based on actual usage

    • Cost optimization: Identify high-resource, low-value models

Troubleshooting common issues

Use these dashboards to diagnose and resolve common model endpoint problems:

High error rates

  • Check the HTTP Status Codes panel for error distribution

  • Correlate with recent deployments or traffic changes

  • Review authentication metrics for 401 errors

  • Examine resource usage for 5xx errors caused by resource exhaustion

Performance degradation

  • Compare current latency percentiles with historical baselines

  • Check CPU and memory usage for resource constraints

  • Review request patterns for traffic spikes

  • Analyze individual model performance for regressions

Resource issues

  • Monitor container restart patterns for stability problems

  • Check CPU and memory trending for capacity planning

  • Review network I/O for bandwidth limitations

  • Examine pod health for infrastructure issues

Best practices for IT administrators

Daily monitoring checklist

  • Review HTTP status code distributions across all models

  • Check resource usage patterns and identify anomalies

  • Verify authentication success rates and security metrics

  • Monitor alert status and resolve any active incidents

Weekly performance review

  • Analyze performance trends and identify degradation patterns

  • Review models with resource inefficiencies

  • Check alert history and tune thresholds if needed

  • Collaborate with data science teams on optimization opportunities

Monthly maintenance tasks

  • Update alert thresholds based on historical performance data

  • Fine-tune dashboards for emerging usage patterns

  • Review access logs and security metrics

  • Plan capacity based on growth trends

Alert strategy recommendations

  • Set up tiered alerting (warning, critical, emergency)

  • Configure different notification channels for different severity levels

  • Implement alert correlation to reduce noise

  • Document runbooks for common alert scenarios

Next steps