Monitoring Domino should happen from the top down, starting at the application layer and working down to the infrastructure layer. This is because, whilst knowing a particular node is about to run out of disk space, or that a certain service is using more CPU than expected is certainly useful, it is often harder to know what constitutes excessive memory usage or pathological network bandwidth consumption, for example, in isolation. By starting at the application layer and working down, you can build a picture of what normal looks like for your Domino deployment and set alerts accordingly.
The following tables start from the application layer and work down through the Kubernetes cluster layer to the underlying infrastructure layer. Each table also includes descriptions with key considerations.
However, everyone’s cluster is different so you must monitor and adjust for your environment. For example, consider how long it might take you to respond when storage size is increasing. You may want to set this value to 50% and escalate at 80%.
Or consider the sort of latency that your users are willing to accept when performing actions in the UI and alert on requests that take significantly longer than that.
Also, remember that Kubernetes manages itself so momentary bursts of activity can trigger alerts that might not be a concern on their own without other key indicators being triggered also. For example, high MongoDB CPU usage might be normal during a backup operation, but if it is sustained for a long period of time alongside a lot of collstats queries for a collection, it might indicate an issue with the indexes on that collection.
Metric | Suggested threshold | Description |
---|---|---|
Latency to | 1000ms | Measures the time to receive a response
to a request to the Domino API server.
If the response time is too high, this suggests that the system is unhealthy and that user experience might be impacted.
This can be measured by calls to the Domino application at a path of |
Dispatcher pod availability from metrics server |
| If the number of pods in the |
frontend pod availability from metrics server |
| If the number of pods in the |
Metric | Suggested threshold | Description |
---|---|---|
Model pods scheduled | >0 for 15 minutes | If model pods are scheduled for a significant amount of time, it might indicate that they will fail to start and must be investigated. |
Zombie Runs | >0 for 15 minutes (Warn) | When a Run completes, the pod must shut itself down. If they continue to run as a zombie pod, this can lead to excess workload on your cluster. Investigate this to identify why the run did not terminate upon completion. |
Failed workload runs | rate > 5 for 15 minutes | If a significant number of runs are failing, this might indicate an issue with the underlying infrastructure or the workload itself. Care should be taken to differentiate between failures due to bad user code and failures due to infrastructure issues. |
Many of the metrics and suggested alert thresholds that follow are duplicates of the overall Kubernetes metrics. However, to ensure that we can identify the issue to a Domino core service and ensure the health of Domino itself, it’s worth monitoring specific events for some of the core services.
To learn more about what each service is responsible for, see Domino architecture.
Nucleus frontend
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes under Kubernetes |
Frontend Pods not ready | >0 for 5 minutes | See notes under Kubernetes |
High GC CPU usage | >15% for 15 minutes | The nucleus frontend is a Java application so it’s important to monitor the standard container metrics that we also monitor JVM health. To do this, use the metric high garbage collection CPU usage. |
Nucleus dispatcher
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes under Kubernetes |
Pods not ready | >0 for 5 minutes | See notes under Kubernetes |
High GC CPU usage | >15% for 15 minutes | Dispatcher, much like the frontend, is a Java-based application, so you must use the Garbage Collection metric to observe the Java application health. |
MongoDB
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes under Kubernetes |
Pods not ready | >0 for 5 minutes | See notes under Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes under Kubernetes |
MongoDB high CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | High CPU usage of Mongo might indicate that it is not behaving as expected |
High PVC usage | >75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical) | Mongo uses persistent storage and, as the database grows, this will fill the storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance. |
mongo.mongod.queryexecutor.scannedPerSecond / mongo.mongod.document.returnedPerSecond | <1 | A value >1 indicates there’s an issue with indexing on the collection. |
Git
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes under Kubernetes |
Pods not ready | >0 for 5 minutes | See notes under Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes under Kubernetes |
Git high CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | High CPU usage of Git can be an indicator that it is not behaving as expected |
High PVC usage | >75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical) | Git uses persistent storage and, as the number of commits grows, this will fill the storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance. |
Git Error rates | >1 count for 5 minutes | Git performs functions such as init, download, and upload as part of its service. Monitoring for errors from these events is an indicator that users are experiencing issues with version control. |
Docker registry
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes under Kubernetes |
Not ready | >0 for 5 minutes | See notes under Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes under Kubernetes |
High CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | If using the deployed Docker registry, you must monitor its CPU usage because significant high usage for prolonged times can be an indicator that it is not behaving as expected. |
Docker registry error rates | >1 unit for 15 minutes | The Docker registry is exposed as an https/http service. Connection failures to the service indicate there might be an issue with images being stored or pulled. |
Docker registry high latency | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | High latency to the service will impact pull and push times for images and lead to a degradation of 0 service. |
Evicted pods | >1 count for 5 minutes | See notes under Kubernetes |
RabbitMQ
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes under Kubernetes |
Pods not ready | >0 for 5 minutes | See notes under Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes under Kubernetes |
High pod memory usage | >75% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | See notes under Kubernetes |
High queue rate | >1000 count for 10 minutes | Rabbit must be continuously sending messages. An increased queue count indicates it cannot send messages and a service is not behaving as expected. |
RabbitMQ low memory | >90 for 10 minutes | Rabbit is a high-memory consuming application. It’s memory usage will be constantly high. A drop in this might indicate it’s not functioning as expected. |
Available TCP sockets | >90% for 10 minutes | Rabbit is the message distributor for all services in Domino. It must be connected to all the services to be able to communicate. If the TCP socket amount free is significantly low it might struggle to create those connections. |
High PVC usage | >75% count for 15 minutes (Warn) >85% count for 15 minutes (Critical) | Rabbit uses persistent storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance. |
Observe the following settings across the entire Kubernetes platform. If the thresholds are hit, it might be an early warning sign of an issue on the platform. This can lead to an issue with Domino for users.
Metric | Suggested threshold | Description |
---|---|---|
Failed pods count | Dependent on cluster (some failed pods in a development environment might be expected). | Observe the number of pods in a failed state. Depending on the type of environment you are in and what else runs on your platform it might be normal to have a few failed pods. Configure the threshold accordingly. |
Containers running out of disk space | FS >75% for 5 minutes (Warn) FS >90% for 5 minutes (Critical) | As well as the underlying operating system disk filling the containers running on your platform, use disk, both ephemeral and persistent, and significant increases in this or running for high intervals with high usage may impact service. |
Container memory usage | >90% for 5 minutes (Warn) >95% for 5 minutes (Critical) | Containers consume memory from the underlying operating system. They are typically configured with requests and limits to prevent one container from consuming all the memory from the system. As workloads grow, the limits might be reached and need to be adjusted. If they aren’t set, you want to ensure that containers are not consuming too much node memory. |
Container CPU usage | >90% for 5 minutes (Warn) >95% for 5 minutes (Critical) | Container CPU works under the same basis as container memory previously described. |
Pods unschedulable | >0 for 7 minutes | If a pod can’t be scheduled, there might be issues on the cluster. You might not have enough nodes so there isn’t enough capacity. There might also be an issue with a specific type of node or a constraint not being met for the pod deployment, such as storage availability. Check these because it can be an early warning sign of an issue. |
Pods not ready | >0 for 10 minutes | Pods must be ready and available in a reasonable time frame. If they are taking significant time to become ready, this can be a sign that something is not running as expected. |
OOM Killed events | >0 for 15 minutes | If a pod consumes too much memory and surpasses its quotas and limits, or significantly impacts the node, an out of memory error can kill it. If this happens, review the application to see if the memory must be adjusted. Also, adjust quotas and limits accordingly. It also might be an underlying issue with the application. |
Pod evictions | >0 for 10 minutes | These occur when a node is resource starved. It might be Kubernetes rebalancing itself and scaling up nodes or shifting workload to another node that is not at capacity. However, it might be an indication that you must manually scale your cluster. |
Replicaset count | >0 pods missing from replicaset | Replicasets are a deployment type that specify a set number of pods that must be running. If the number of replica pods is less than the count according to the replicaset, something is likely wrong. |
ImagePullBackOff | >10 count for 5 minutes | All pods run an image that comes from a registry, either directly from an upstream Domino registry or from some form of proxied internal registry. If you are getting ImagePullBackOff failures this might indicate an issue with the network issue connecting to it, the registry, or an authentication problem to the registry. |
Observe and monitor the following for each node in your Kubernetes cluster.
Metric | Suggested threshold | Description |
---|---|---|
Average CPU usage | >80% for 15 minutes | Average node CPU usage must not be significantly high for long periods of time |
Average memory usage | >90% for 15 minutes | Average node memory usage must not be significantly high for long periods of time |
Disk usage | FS >85% for 15 minutes | Local disk can be used for both the underlying operating system functionality as well as Kubernetes and the containers running on it. It might spike during high runs of containers and dip. This is normal behavior, but it should not be consistently high. |
Node not ready status | >0 for 30 minutes | If a node is in a not ready state, it cannot accept containers, so your Kubernetes platform will not be run at full capacity. |
Shared file system sizes | FS>75% (Warn) FS >90% (Critical) | Domino uses shared file systems for backing a number of its persistent volumes. These must be monitored and increased as workloads and volumes grow. |