Use these dashboards to monitor the health and performance of the Domino Data Plane Agents that manage executions and Kubernetes resources.
Track agent status and communication health. Identify outages, disconnections, and service disruptions.
Panels: Agent Health, Data Plane State
-
Indicators
-
State: Healthy, Unhealthy, Not Running
-
Connectivity to Nucleus, RabbitMQ
-
-
What to watch
-
Disruptions, extended outages, unstable state changes
-
-
Targets
-
Availability >99.9%
-
Health check <5 sec
-
Measure how efficiently agents process execution-related messages. Detect delays and capacity issues.
Panels: Message Duration, Throughput, p95 Response Times
-
Metrics
-
p95 roundtrip <2 sec
-
Average message duration <1 sec
-
-
Bottlenecks
-
Long durations: API or processing delays
-
Low throughput: Capacity issues
-
-
Message types
-
CREATE, UPDATE, DELETE, GET
-
Monitor how the agent interacts with the Kubernetes API. Spot slow operations and API-related failures.
Panels: Kube API Request Duration
-
Targets
-
p95 <500ms (general)
-
CREATE <2 sec, UPDATE/DELETE <1 sec, GET <200ms
-
-
Operations
-
Pod/Service/Config/Namespace management
-
-
Optimization
-
Monitor API performance and network latency
-
Adjust RBAC and resource specs
-
Check agent CPU and memory usage. Identify overuse, leaks, or scaling limits.
Panels: Memory and CPU Use vs Requests and Limits
-
Targets
-
Memory <80% limit
-
CPU <70% average, <90% peak
-
-
Optimization
-
Right-size resources
-
Watch for memory leaks
-
Set alerts on usage thresholds
-
-
Use execution monitoring dashboards - Track workload performance and identify issues early, and optimize execution across your deployment.
-
Work with model endpoint monitoring dashboards - Monitor the health, reliability, and performance of your model APIs to detect issues quickly and improve model serving.