Use execution monitoring dashboards

Use these dashboards to monitor workload execution, spot performance issues early, and optimize performance across your deployment.

Execution health and status

Use these dashboards to monitor workload execution, spot issues early, and optimize performance across your deployment.

Panels: Active Executions, Success Rate, Execution Failures

Key metrics
- Execution status breakdown (Running, Pending, Preparing, Queued, Finishing)
- Success rate over 30-minute intervals
- Failure types (System vs User)
What to watch
- High Pending: Resource bottlenecks
- Long Preparing: Image pull or volume mount delays
- Success rate <85%: Immediate investigation needed
- Growing Queued: Scheduler issues
Targets
- Success rate: 95–100% (healthy), 85–95% (watch), <85% (investigate)
- Pending time: <5 minutes

Check workload status and failure trends in real time. Use this dashboard to detect execution problems and track system health.

Panels: Time to Available, Startup Duration, Time in Checkpoint

Phases
- NodeAssigned: Resource allocation
- ImagesPulled: Container setup
- VolumesMounted: Storage readiness
- FilesPrepared: Git/file staging
- ExecutionAvailable: Ready state
Optimization tips
- Delay in NodeAssigned: Scale node pools
- Bottlenecks in ImagesPulled: Use smaller images
- Slow VolumesMounted: Investigate storage
- Lag in FilesPrepared: Check repo size or network

Track resource availability and scaling behavior. Use this dashboard to adjust pool sizes and avoid resource constraints.

Panels: Node Pool Size, Resource Utilization

Indicators
- Pool size trends
- Resource usage per pool
- Auto-scaling effectiveness
Strategies
- Overprovision spare nodes for fast start
- Tune pool size for cost/performance balance
- Adjust scaling thresholds

Monitor how long container images take to download. Use this data to optimize images and reduce start up delays.

Panels: Compute Environment Pull Statistics

Metrics
- Pull count per image
- Average pull duration
Performance thresholds
- <30 sec: Optimal
- 30–120 sec: Investigate
- 120 sec: Action required
Improvements
- Use optimized, smaller container images

Work with model endpoint monitoring dashboards - Monitor the health, reliability, and performance of your model APIs to detect issues quickly and improve model serving.
Run data plane agent monitoring dashboards - Track the health and performance of Domino Data Plane Agents to monitor execution management and Kubernetes resource activity.