Use these dashboards to monitor workload execution, spot performance issues early, and optimize performance across your deployment.
Use these dashboards to monitor workload execution, spot issues early, and optimize performance across your deployment.
Panels: Active Executions, Success Rate, Execution Failures
-
Key metrics
-
Execution status breakdown (Running, Pending, Preparing, Queued, Finishing)
-
Success rate over 30-minute intervals
-
Failure types (System vs User)
-
-
What to watch
-
High Pending: Resource bottlenecks
-
Long Preparing: Image pull or volume mount delays
-
Success rate <85%: Immediate investigation needed
-
Growing Queued: Scheduler issues
-
-
Targets
-
Success rate: 95–100% (healthy), 85–95% (watch), <85% (investigate)
-
Pending time: <5 minutes
-
Check workload status and failure trends in real time. Use this dashboard to detect execution problems and track system health.
Panels: Time to Available, Startup Duration, Time in Checkpoint
-
Phases
-
NodeAssigned: Resource allocation
-
ImagesPulled: Container setup
-
VolumesMounted: Storage readiness
-
FilesPrepared: Git/file staging
-
ExecutionAvailable: Ready state
-
-
Optimization tips
-
Delay in NodeAssigned: Scale node pools
-
Bottlenecks in ImagesPulled: Use smaller images
-
Slow VolumesMounted: Investigate storage
-
Lag in FilesPrepared: Check repo size or network
-
Track resource availability and scaling behavior. Use this dashboard to adjust pool sizes and avoid resource constraints.
Panels: Node Pool Size, Resource Utilization
-
Indicators
-
Pool size trends
-
Resource usage per pool
-
Auto-scaling effectiveness
-
-
Strategies
-
Overprovision spare nodes for fast start
-
Tune pool size for cost/performance balance
-
Adjust scaling thresholds
-
Monitor how long container images take to download. Use this data to optimize images and reduce start up delays.
Panels: Compute Environment Pull Statistics
-
Metrics
-
Pull count per image
-
Average pull duration
-
-
Performance thresholds
-
<30 sec: Optimal
-
30–120 sec: Investigate
-
120 sec: Action required
-
-
Improvements
-
Use optimized, smaller container images
-
-
Work with model endpoint monitoring dashboards - Monitor the health, reliability, and performance of your model APIs to detect issues quickly and improve model serving.
-
Run data plane agent monitoring dashboards - Track the health and performance of Domino Data Plane Agents to monitor execution management and Kubernetes resource activity.