Use model endpoint monitoring dashboards

Use these dashboards to monitor the health, reliability, and performance of your model APIs. Identify issues quickly and optimize model serving.

Model health and reliability

Track status codes and success rates to identify errors, uptime issues, and model failures.

Panels: Model Performance Summary, HTTP Status Codes, Success Rate

  • Key metrics

    • Status code distribution (2xx, 4xx, 5xx)

    • Success rate (excluding 401s)

    • Request volumes and patterns

  • What to watch

    • Success rate <99%: Reliability concerns

    • High 4xx: Client-side misconfiguration

    • High 5xx: Model or infra failures

  • Targets

    • Success rate: >99% (excellent), 95–99% (acceptable), <95% (investigate)

    • 4xx: <1% of requests

    • 5xx: <0.1% of requests

Response time and latency

Measure model responsiveness. Use latency percentiles to detect performance bottlenecks.

Panels: Latency Percentiles, Request Times

  • Metrics

    • P50, P90, P95, P99 latency

    • Upstream vs total response time

  • Focus areas

    • High P50: Core model slowness

    • High P95/P99: Sporadic performance issues

    • Consistent high latency: Requires optimization

Traffic patterns and load

Monitor request volume and distribution. Identify spikes, uneven traffic, and scaling needs.

Panels: Request Volume, Model Paths, Request Rate

  • Indicators

    • Requests per second

    • Path-level distribution

    • Load balancing efficiency

  • Actions

    • Prepare for peak loads

    • Address skewed distribution

    • Track growth trends

Resource utilization and performance

Check pod health, CPU, memory, and network usage. Detect resource constraints and failures.

Panels: Pods, Container Restarts, CPU/Memory/Network

  • Targets

    • CPU: <70% average, <90% peak

    • Memory: <80% of limit

    • Restarts: <1/day per pod

  • Optimization

    • Scale based on usage

    • Address frequent restarts

    • Monitor network throughput

Model-specific performance

Compare performance across models. Identify underperformers and fine-tune resource allocation.

Panels: Response Times Table, Response Size, Request Breakdown

  • Metrics

    • Latency per model

    • Payload sizes

    • Request and error frequency

  • Insights

    • Identify slow models

    • Track regressions

    • Allocate resources per model

Next steps