Metrics to monitor

Monitoring and alerting best practices

Monitoring Domino should happen from the top down, starting at the application layer and working down to the infrastructure layer. This is because, whilst knowing a particular node is about to run out of disk space, or that a certain service is using more CPU than expected is certainly useful, it is often harder to know what constitutes excessive memory usage or pathological network bandwidth consumption, for example, in isolation. By starting at the application layer and working down, you can build a picture of what normal looks like for your Domino deployment and set alerts accordingly.

The following tables start from the application layer and work down through the Kubernetes cluster layer to the underlying infrastructure layer. Each table also includes descriptions with key considerations.

However, everyone’s cluster is different so you must monitor and adjust for your environment. For example, consider how long it might take you to respond when storage size is increasing. You may want to set this value to 50% and escalate at 80%.

Or consider the sort of latency that your users are willing to accept when performing actions in the UI and alert on requests that take significantly longer than that.

Also, remember that Kubernetes manages itself so momentary bursts of activity can trigger alerts that might not be a concern on their own without other key indicators being triggered also. For example, high MongoDB CPU usage might be normal during a backup operation, but if it is sustained for a long period of time alongside a lot of collstats queries for a collection, it might indicate an issue with the indexes on that collection.

Application

MetricSuggested thresholdDescription

Latency to /health

1000ms

Measures the time to receive a response to a request to the Domino API server. If the response time is too high, this suggests that the system is unhealthy and that user experience might be impacted. This can be measured by calls to the Domino application at a path of /health.

Dispatcher pod availability from metrics server

nucleus-dispatcher pods available = 0 for >10 minutes

If the number of pods in the nucleus-dispatcher deployment is 0 for greater than 10 minutes, it’s an indication of critical issues that Domino will not automatically recover from, and functionality will be degraded.

frontend pod availability from metrics server

nucleus-front-end pods available < 2 for >10 minutes

If the number of pods in the nucleus-front-end deployment is less than two for greater than 10 minutes, it’s an indication of critical issues that Domino will not automatically recover from, and functionality will be degraded.

Workload executions

MetricSuggested thresholdDescription

Model pods scheduled

>0 for 15 minutes

If model pods are scheduled for a significant amount of time, it might indicate that they will fail to start and must be investigated.

Zombie Runs

>0 for 15 minutes (Warn)

When a Run completes, the pod must shut itself down. If they continue to run as a zombie pod, this can lead to excess workload on your cluster. Investigate this to identify why the run did not terminate upon completion.

Failed workload runs

rate > 5 for 15 minutes

If a significant number of runs are failing, this might indicate an issue with the underlying infrastructure or the workload itself. Care should be taken to differentiate between failures due to bad user code and failures due to infrastructure issues.

Domino services

Many of the metrics and suggested alert thresholds that follow are duplicates of the overall Kubernetes metrics. However, to ensure that we can identify the issue to a Domino core service and ensure the health of Domino itself, it’s worth monitoring specific events for some of the core services.

To learn more about what each service is responsible for, see Domino architecture.

Nucleus frontend

MetricSuggested thresholdDescription

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes under Kubernetes

Frontend Pods not ready

>0 for 5 minutes

See notes under Kubernetes

High GC CPU usage

>15% for 15 minutes

The nucleus frontend is a Java application so it’s important to monitor the standard container metrics that we also monitor JVM health. To do this, use the metric high garbage collection CPU usage.

Nucleus dispatcher

MetricSuggested thresholdDescription

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes under Kubernetes

Pods not ready

>0 for 5 minutes

See notes under Kubernetes

High GC CPU usage

>15% for 15 minutes

Dispatcher, much like the frontend, is a Java-based application, so you must use the Garbage Collection metric to observe the Java application health.

MongoDB

MetricSuggested thresholdDescription

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes under Kubernetes

Pods not ready

>0 for 5 minutes

See notes under Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes under Kubernetes

MongoDB high CPU usage

>85% count for 10 minutes (Warn)

>100% count for 10 minutes (Critical)

High CPU usage of Mongo might indicate that it is not behaving as expected

High PVC usage

>75% count for 15 minutes (Warn)

>80% count for 15 minutes (Critical)

Mongo uses persistent storage and, as the database grows, this will fill the storage. This might have to be increased over time.

High PVC inode usage

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

mongo.mongod.queryexecutor.scannedPerSecond / mongo.mongod.document.returnedPerSecond

<1

A value >1 indicates there’s an issue with indexing on the collection.

Git

MetricSuggested thresholdDescription

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes under Kubernetes

Pods not ready

>0 for 5 minutes

See notes under Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes under Kubernetes

Git high CPU usage

>85% count for 10 minutes (Warn)

>100% count for 10 minutes (Critical)

High CPU usage of Git can be an indicator that it is not behaving as expected

High PVC usage

>75% count for 15 minutes (Warn)

>80% count for 15 minutes (Critical)

Git uses persistent storage and, as the number of commits grows, this will fill the storage. This might have to be increased over time.

High PVC inode usage

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

Git Error rates

>1 count for 5 minutes

Git performs functions such as init, download, and upload as part of its service. Monitoring for errors from these events is an indicator that users are experiencing issues with version control.

Docker registry

MetricSuggested thresholdDescription

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes under Kubernetes

Not ready

>0 for 5 minutes

See notes under Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes under Kubernetes

High CPU usage

>85% count for 10 minutes (Warn)

>100% count for 10 minutes (Critical)

If using the deployed Docker registry, you must monitor its CPU usage because significant high usage for prolonged times can be an indicator that it is not behaving as expected.

Docker registry error rates

>1 unit for 15 minutes

The Docker registry is exposed as an https/http service. Connection failures to the service indicate there might be an issue with images being stored or pulled.

Docker registry high latency

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

High latency to the service will impact pull and push times for images and lead to a degradation of 0 service.

Evicted pods

>1 count for 5 minutes

See notes under Kubernetes

RabbitMQ

MetricSuggested thresholdDescription

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes under Kubernetes

Pods not ready

>0 for 5 minutes

See notes under Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes under Kubernetes

High pod memory usage

>75% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

See notes under Kubernetes

High queue rate

>1000 count for 10 minutes

Rabbit must be continuously sending messages. An increased queue count indicates it cannot send messages and a service is not behaving as expected.

RabbitMQ low memory

>90 for 10 minutes

Rabbit is a high-memory consuming application. It’s memory usage will be constantly high. A drop in this might indicate it’s not functioning as expected.

Available TCP sockets

>90% for 10 minutes

Rabbit is the message distributor for all services in Domino. It must be connected to all the services to be able to communicate. If the TCP socket amount free is significantly low it might struggle to create those connections.

High PVC usage

>75% count for 15 minutes (Warn)

>85% count for 15 minutes (Critical)

Rabbit uses persistent storage. This might have to be increased over time.

High PVC inode usage

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

Kubernetes

Observe the following settings across the entire Kubernetes platform. If the thresholds are hit, it might be an early warning sign of an issue on the platform. This can lead to an issue with Domino for users.

MetricSuggested thresholdDescription

Failed pods count

Dependent on cluster (some failed pods in a development environment might be expected).

Observe the number of pods in a failed state. Depending on the type of environment you are in and what else runs on your platform it might be normal to have a few failed pods. Configure the threshold accordingly.

Containers running out of disk space

FS >75% for 5 minutes (Warn)

FS >90% for 5 minutes (Critical)

As well as the underlying operating system disk filling the containers running on your platform, use disk, both ephemeral and persistent, and significant increases in this or running for high intervals with high usage may impact service.

Container memory usage

>90% for 5 minutes (Warn)

>95% for 5 minutes (Critical)

Containers consume memory from the underlying operating system. They are typically configured with requests and limits to prevent one container from consuming all the memory from the system. As workloads grow, the limits might be reached and need to be adjusted. If they aren’t set, you want to ensure that containers are not consuming too much node memory.

Container CPU usage

>90% for 5 minutes (Warn)

>95% for 5 minutes (Critical)

Container CPU works under the same basis as container memory previously described.

Pods unschedulable

>0 for 7 minutes

If a pod can’t be scheduled, there might be issues on the cluster. You might not have enough nodes so there isn’t enough capacity. There might also be an issue with a specific type of node or a constraint not being met for the pod deployment, such as storage availability. Check these because it can be an early warning sign of an issue.

Pods not ready

>0 for 10 minutes

Pods must be ready and available in a reasonable time frame. If they are taking significant time to become ready, this can be a sign that something is not running as expected.

OOM Killed events

>0 for 15 minutes

If a pod consumes too much memory and surpasses its quotas and limits, or significantly impacts the node, an out of memory error can kill it. If this happens, review the application to see if the memory must be adjusted. Also, adjust quotas and limits accordingly. It also might be an underlying issue with the application.

Pod evictions

>0 for 10 minutes

These occur when a node is resource starved. It might be Kubernetes rebalancing itself and scaling up nodes or shifting workload to another node that is not at capacity. However, it might be an indication that you must manually scale your cluster.

Replicaset count

>0 pods missing from replicaset

Replicasets are a deployment type that specify a set number of pods that must be running. If the number of replica pods is less than the count according to the replicaset, something is likely wrong.

ImagePullBackOff

>10 count for 5 minutes

All pods run an image that comes from a registry, either directly from an upstream Domino registry or from some form of proxied internal registry. If you are getting ImagePullBackOff failures this might indicate an issue with the network issue connecting to it, the registry, or an authentication problem to the registry.

Infrastructure

Observe and monitor the following for each node in your Kubernetes cluster.

MetricSuggested thresholdDescription

Average CPU usage

>80% for 15 minutes

Average node CPU usage must not be significantly high for long periods of time

Average memory usage

>90% for 15 minutes

Average node memory usage must not be significantly high for long periods of time

Disk usage

FS >85% for 15 minutes

Local disk can be used for both the underlying operating system functionality as well as Kubernetes and the containers running on it. It might spike during high runs of containers and dip. This is normal behavior, but it should not be consistently high.

Node not ready status

>0 for 30 minutes

If a node is in a not ready state, it cannot accept containers, so your Kubernetes platform will not be run at full capacity.

Shared file system sizes

FS>75% (Warn)

FS >90% (Critical)

Domino uses shared file systems for backing a number of its persistent volumes. These must be monitored and increased as workloads and volumes grow.