Monitoring the health of Domino also requires monitoring the underlying infrastructure, like Kubernetes. Health issues in the Kubernetes cluster can sometimes be automatically remediated, but monitoring at each layer of the platform can help identify and remediate issues faster.
The following tables start from Domino and work down the layers of the stack to include the underlying infrastructure.
Each table also includes descriptions with considerations. However, everyone’s cluster is different so you must monitor and adjust according to your Environment. For example, consider how long it might take you to respond when storage size is increasing. You might want to set this value to 50% and escalate to 80%.
Also, remember that Kubernetes manages itself so momentary bursts can cause alerts that might not be a concern.
Domino recommends tracking these metrics in order of priority:
Metric | Suggested threshold | Description |
---|---|---|
Latency to | 1000ms | Measures the time to receive a response,
to a request, to the Domino API server.
If the response time is too high, the system may be unhealthy and user experience might be impacted.
This can be measured by calls to the Domino application at a path of |
Dispatcher pod availability from metrics server |
| If the number of pods in the |
Frontend pod availability from metrics server |
| If the number of pods in the |
Many of the metrics and suggested alert thresholds that follow are duplicates of the overall Kubernetes metrics. However, to ensure that we can link the issue to a Domino core service and ensure the health of Domino itself, it’s worth monitoring specific events for some of the core services.
To learn more about what each service is responsible for, see Architecture.
Nucleus frontend
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | Pods are being evicted. |
Frontend Pods not ready | >0 for 5 minutes | Pods are in a not-ready state. |
High GC CPU usage | >15% for 15 minutes | The nucleus frontend is a Java-based application. Therefore, it’s important to monitor the standard container metrics so that we also monitor JVM health. To do this, use the High Garbage Collection CPU usage metric. |
Nucleus dispatcher
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | Pods are being evicted. |
Pods not ready | >0 for 5 minutes | Frontend pods are in a not-ready state. |
High GC CPU usage | >15% for 15 minutes | Dispatcher, much like the frontend, is a Java-based application. Therefore, you must use the Garbage Collection metric to observe the Java application’s health. |
MongoDB
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes. |
Pods not ready | >0 for 5 minutes | See notes on Kubernetes. |
ReplicaSet degraded | >80% for 5 minutes | See notes on Kubernetes. |
MongoDB high CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | High CPU usage of MongoDB might indicate that it is not behaving as expected. |
High PVC usage | >75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical) | MongoDB uses persistent storage and, as the database grows, this will fill the storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | It will both fill up space and continually read and write to disk. High inode usage can lead to performance degradation. |
mongo.mongod.queryexecutor.scannedPerSecond / mongo.mongod.document.returnedPerSecond | <1 | A value of >1 indicates there’s an issue with indexing on the collection. |
Git
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes. |
Pods not ready | >0 for 5 minutes | See notes on Kubernetes. |
ReplicaSet degraded | >80% for 5 minutes | See notes on Kubernetes. |
Git high CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | High CPU usage of Git can be an indicator that it is not behaving as expected. |
High PVC usage | >75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical) | Git uses persistent storage and, as the number of commits grows, this will fill up the storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | It will both fill up space and continually read and write to disk. High inode usage can lead to performance degradation. |
Git Error rates | >1 count for 5 minutes | Git performs functions such as |
Docker registry
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes. |
Not ready | >0 for 5 minutes | See notes on Kubernetes. |
ReplicaSet degraded | >80% for 5 minutes | See notes on Kubernetes. |
High CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | If you are using the deployed Docker registry, you must monitor its CPU usage because significantly high usage for prolonged times can be an indicator that it is not behaving as expected. |
Docker registry error rates | >1 unit for 15 minutes | The Docker registry is exposed as an https/http service. Connection failures to the service indicate there might be an issue with images being stored or pulled. |
Docker registry high latency | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | High latency to the service will impact pull and push times for images and lead to a degradation of 0 service. |
Evicted pods | >1 count for 5 minutes | See notes on Kubernetes. |
RabbitMQ
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes. |
Pods not ready | >0 for 5 minutes | See notes on Kubernetes. |
ReplicaSet degraded | >80% for 5 minutes | See notes on Kubernetes. |
High pod memory usage | >75% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | See notes on Kubernetes. |
High queue rate | >1000 count for 10 minutes | Rabbit must be continuously sending messages. An increased queue count indicates it cannot send messages and a service is not behaving as expected. |
RabbitMQ low memory | >90 for 10 minutes | Rabbit is a high-memory consuming application. Its memory usage will be constantly high. A drop in this might indicate it’s not functioning as expected. |
Available TCP sockets | >90% for 10 minutes | Rabbit is the message distributor for all services in Domino. It must be connected to all the services to be able to communicate. If the free TCP socket amount is significantly low, it might struggle to create those connections. |
High PVC usage | >75% count for 15 minutes (Warn) >85% count for 15 minutes (Critical) | Rabbit uses persistent storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | It will both fill up space and continually read and write to disk. High inode usage can lead to performance degradation. |
Execution layer
Metric | Suggested threshold | Description |
---|---|---|
Model pods scheduled | >0 for 15 minutes | If model pods are scheduled for a significant amount of time, it might indicate that they will fail to start and must be investigated. |
Zombie Runs | >0 for 15 minutes (Warn) | When a Run completes, the pod must shut itself down. If they continue to run as a zombie pod, this can lead to excess workload on your cluster. Investigate this to identify why the run did not terminate upon completion. |
Observe and monitor the following for each node in your Kubernetes cluster.
Metric | Suggested threshold | Description |
---|---|---|
Average CPU usage | >80% for 15 minutes | The average node CPU usage must not be significantly high for long periods. |
Average memory usage | >90% for 15 minutes | Average node memory usage must not be significantly high for long periods. |
Disk usage | FS >85% for 15 minutes | Local disk can be used for both the underlying operating system functionality as well as Kubernetes and the containers running on it. It might spike during high runs of containers and then dip. This is normal behavior but it should not be consistently high. |
Node not ready status | >0 for 30 minutes | If a node is in a not-ready state, it cannot accept containers, so your Kubernetes platform will not be running at full capacity. |
Shared file system sizes | FS>75% (Warn) FS >90% (Critical) | Domino uses shared file systems for backing a number of its persistent volumes. These must be monitored and increased as workloads and volumes grow. |
Observe the following settings across the entire Kubernetes platform. If the thresholds are hit, it might be an early warning sign of an issue on the platform. This can lead to an issue with Domino for users.
Metric | Suggested threshold | Description |
---|---|---|
Failed pods count | Dependent on the cluster (some failed pods in a development environment might be expected). | Observe the number of pods in a failed state. Depending on the type of environment you are in and what else is running on your platform, it might be normal to have a few failed pods. Configure the threshold accordingly. |
Containers running out of disk space | FS >75% for 5 minutes (Warn) FS >90% for 5 minutes (Critical) | As well as the underlying operating system disk filling the containers running on your platform, use disk, both ephemeral and persistent, and significant increases in this or running for high intervals with high usage may impact service. |
Container memory usage | >90% for 5 minutes (Warn) >95% for 5 minutes (Critical) | Containers consume memory from the underlying operating system. They are typically configured with requests and limits to prevent one container from consuming all the memory from the system. As workloads grow, the limits might be reached and need to be adjusted. If they aren’t set, you want to ensure that containers are not consuming too much node memory. |
Container CPU usage | >90% for 5 minutes (Warn) >95% for 5 minutes (Critical) | Container CPU works under the same basis as container memory previously described. |
Pods unschedulable | >0 for 7 minutes | If a pod can’t be scheduled, there might be issues on the cluster. You might not have enough nodes so there isn’t enough capacity. There might also be an issue with a specific type of node or a constraint not being met for the pod deployment, such as storage availability. Check these because it can be an early warning sign of an issue. |
Pods not ready | >0 for 10 minutes | Pods must be ready and available in a reasonable time frame. If they are taking significant time to become ready, this can be a sign that something is not running as expected. |
OOM Killed events | >0 for 15 minutes | If a pod consumes too much memory and surpasses its quotas and limits, or significantly impacts the node, an out-of-memory error can kill it. If this happens, review the application to see if the memory must be adjusted. Also, adjust quotas and limits accordingly. It also might be an underlying issue with the application. |
Pod evictions | >0 for 10 minutes | These occur when a node is resource-starved. It might be Kubernetes rebalancing itself and scaling up nodes or shifting the workload to another node that is not at capacity. However, it might be an indication that you must manually scale your cluster. |
ReplicaSet count | >0 pods missing from ReplicaSet | ReplicaSets are a deployment type that specifies a set number of pods that must be running. If the number of replica pods is less than the count according to the ReplicaSet, something is likely wrong. |
ImagePullBackOff | >10 count for 5 minutes | All pods run an image that comes from a registry, either directly from an upstream Domino registry or from some form of proxied internal registry.
If you are getting |