Monitoring Domino involves tracking several key application metrics. These metrics reveal the health of the application and can provide advance warning of any issues or failures of Domino components.
This list is not exhaustive.
Domino deployments include a pre-configured Grafana instance which can be used for monitoring. You can also use several other application monitoring tools in addition to Grafana to track these metrics, including:
Domino runs in Kubernetes, which is an orchestration framework for containerized applications. In this model, the following are the distinct layers with their own relevant metrics:
- Domino application
-
This is the top layer, representing Domino application components running in containers that are deployed and managed by Kubernetes. The content in this Admin guide focuses on operations in this layer.
- Kubernetes cluster
-
This is the Kubernetes software-defined hardware abstraction and orchestration system that manages the deployment and lifecycle of Domino application components. Cluster operations are handled a layer below Domino, but do have to consider the Domino architecture and cluster requirements. For guidance about general cluster administration, see the official Kubernetes documentation.
- Host infrastructure
-
This is the bottom layer that represents the virtual or physical host machines that are doing work as nodes in the Kubernetes cluster. Information technology owners of the infrastructure are responsible for operations in this layer, including management of compute and storage resources, as well as operating system patching. Domino does not have any unique or unusual requirements in this layer.
The following tables start from the underlying infrastructure and build up in layers to the Domino core services. Each table also includes descriptions with considerations. However, everyone’s cluster is different so you must monitor and adjust for your environment. For example, consider how long it might take you to respond when storage size is increasing. You might want to set this value to 50% and escalate at 80%.
Also, remember that Kubernetes manages itself so momentary bursts can cause alerts that might not be a concern.
Domino recommends tracking these metrics in priority order:
Metric | Suggested threshold | Description |
---|---|---|
Latency to | 1000ms | Measures the time to receive a response
to a request to the Domino API server.
If the response time is too high, this suggests that the system is unhealthy and that user experience might be impacted.
This can be measured by calls to the Domino application at a path of |
Dispatcher pod availability from metrics server |
| If the number of pods in the |
frontend pod availability from metrics server |
| If the number of pods in the |
Observe and monitor the following for each node in your Kubernetes cluster.
Metric | Suggested threshold | Description |
---|---|---|
Average CPU usage | >80% for 15 minutes | Average node CPU usage must not be significantly high for long periods of time |
Average memory usage | >90% for 15 minutes | Average node memory usage must not be significantly high for long periods of time |
Disk usage | FS >85% for 15 minutes | Local disk can be used for both the underlying operating system functionality as well as Kubernetes and the containers running on it. It might spike during high runs of containers and dip. This is normal behavior, but it should not be consistently high. |
Node not ready status | >0 for 30 minutes | If a node is in a not ready state, it cannot accept containers, so your Kubernetes platform will not be run at full capacity. |
Shared file system sizes | FS>75% (Warn) FS >90% (Critical) | Domino uses shared file systems for backing a number of its persistent volumes. These must be monitored and increased as workloads and volumes grow. |
Observe the following settings across the entire Kubernetes platform. If the thresholds are hit, it might be an early warning sign of an issue on the platform. This can lead to an issue with Domino for users.
Metric | Suggested threshold | Description |
---|---|---|
Failed pods count | Dependent on cluster (some failed pods in a development environment might be expected). | Observe the number of pods in a failed state. Depending on the type of environment you are in and what else runs on your platform it might be normal to have a few failed pods. Configure the threshold accordingly. |
Containers running out of disk space | FS >75% for 5 minutes (Warn) FS >90% for 5 minutes (Critical) | As well as the underlying operating system disk filling the containers running on your platform, use disk, both ephemeral and persistent, and significant increases in this or running for high intervals with high usage may impact service. |
Container memory usage | >90% for 5 minutes (Warn) >95% for 5 minutes (Critical) | Containers consume memory from the underlying operating system. They are typically configured with requests and limits to prevent one container from consuming all the memory from the system. As workloads grow, the limits might be reached and need to be adjusted. If they aren’t set, you want to ensure that containers are not consuming too much node memory. |
Container CPU usage | >90% for 5 minutes (Warn) >95% for 5 minutes (Critical) | Container CPU works under the same basis as container memory previously described. |
Pods unschedulable | >0 for 7 minutes | If a pod can’t be scheduled, there might be issues on the cluster. You might not have enough nodes so there isn’t enough capacity. There might also be an issue with a specific type of node or a constraint not being met for the pod deployment, such as storage availability. Check these because it can be an early warning sign of an issue. |
Pods not ready | >0 for 10 minutes | Pods must be ready and available in a reasonable time frame. If they are taking significant time to become ready, this can be a sign that something is not running as expected. |
OOM Killed events | >0 for 15 minutes | If a pod consumes too much memory and surpasses its quotas and limits, or significantly impacts the node, an out of memory error can kill it. If this happens, review the application to see if the memory must be adjusted. Also, adjust quotas and limits accordingly. It also might be an underlying issue with the application. |
Pod evictions | >0 for 10 minutes | These occur when a node is resource starved. It might be Kubernetes rebalancing itself and scaling up nodes or shifting workload to another node that is not at capacity. However, it might be an indication that you must manually scale your cluster. |
Replicaset count | >0 pods missing from replicaset | Replicasets are a deployment type that specify a set number of pods that must be running. If the number of replica pods is less than the count according to the replicaset, something is likely wrong. |
ImagePullBackOff | >10 count for 5 minutes | All pods run an image that comes from a registry, either directly from an upstream Domino registry or from some form of proxied internal registry. If you are getting ImagePullBackOff failures this might indicate an issue with the network issue connecting to it, the registry, or an authentication problem to the registry. |
Many of the metrics and suggested alert thresholds that follow are duplicates of the overall Kubernetes metrics. However, to ensure that we can identify the issue to a Domino core service and ensure the health of Domino itself, it’s worth monitoring specific events for some of the core services.
To learn more about what each service is responsible for, see Architecture.
Nucleus frontend
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | Pods being evicted |
Frontend Pods not ready | >0 for 5 minutes | Pods in a not ready state |
High GC CPU usage | >15% for 15 minutes | The nucleus frontend is a Java application so it’s important to monitor the standard container metrics that we also monitor JVM health. To do this, use the metric high garbage collection CPU usage. |
Nucleus dispatcher
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | Pods being evicted |
Pods not ready | >0 for 5 minutes | Frontend pods in a not ready state |
High GC CPU usage | >15% for 15 minutes | Dispatcher, much like the frontend, is a Java-based application, so you must use the Garbage Collection metric to observe the Java application health. |
MongoDB
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes |
Pods not ready | >0 for 5 minutes | See notes on Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes on Kubernetes |
MongoDB high CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | High CPU usage of Mongo might indicate that it is not behaving as expected |
High PVC usage | >75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical) | Mongo uses persistent storage and, as the database grows, this will fill the storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance. |
mongo.mongod.queryexecutor.scannedPerSecond / mongo.mongod.document.returnedPerSecond | <1 | A value >1 indicates there’s an issue with indexing on the collection. |
Git
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes |
Pods not ready | >0 for 5 minutes | See notes on Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes on Kubernetes |
Git high CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | High CPU usage of Git can be an indicator that it is not behaving as expected |
High PVC usage | >75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical) | Git uses persistent storage and, as the number of commits grows, this will fill the storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance. |
Git Error rates | >1 count for 5 minutes | Git performs functions such as init, download, and upload as part of its service. Monitoring for errors from these events is an indicator that users are experiencing issues with version control. |
Docker registry
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes |
Not ready | >0 for 5 minutes | See notes on Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes on Kubernetes |
High CPU usage | >85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical) | If using the deployed Docker registry, you must monitor its CPU usage because significant high usage for prolonged times can be an indicator that it is not behaving as expected. |
Docker registry error rates | >1 unit for 15 minutes | The Docker registry is exposed as an https/http service. Connection failures to the service indicate there might be an issue with images being stored or pulled. |
Docker registry high latency | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | High latency to the service will impact pull and push times for images and lead to a degradation of 0 service. |
Evicted pods | >1 count for 5 minutes | See notes on Kubernetes |
RabbitMQ
Metric | Suggested threshold | Description |
---|---|---|
Evicted pods | >0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical) | See notes on Kubernetes |
Pods not ready | >0 for 5 minutes | See notes on Kubernetes |
Replica Set degraded | >80% for 5 minutes | See notes on Kubernetes |
High pod memory usage | >75% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | See notes on Kubernetes |
High queue rate | >1000 count for 10 minutes | Rabbit must be continuously sending messages. An increased queue count indicates it cannot send messages and a service is not behaving as expected. |
RabbitMQ low memory | >90 for 10 minutes | Rabbit is a high-memory consuming application. It’s memory usage will be constantly high. A drop in this might indicate it’s not functioning as expected. |
Available TCP sockets | >90% for 10 minutes | Rabbit is the message distributor for all services in Domino. It must be connected to all the services to be able to communicate. If the TCP socket amount free is significantly low it might struggle to create those connections. |
High PVC usage | >75% count for 15 minutes (Warn) >85% count for 15 minutes (Critical) | Rabbit uses persistent storage. This might have to be increased over time. |
High PVC inode usage | >80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical) | As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance. |
Metric | Suggested threshold | Description |
---|---|---|
Model pods scheduled | >0 for 15 minutes | If model pods are scheduled for a significant amount of time, it might indicate that they will fail to start and must be investigated. |
Zombie Runs | >0 for 15 minutes (Warn) | When a Run completes, the pod must shut itself down. If they continue to run as a zombie pod, this can lead to excess workload on your cluster. Investigate this to identify why the run did not terminate upon completion. |