Drift Detection

The fleetcommand-agent image that runs operator jobs creates definitions for an additional custom resource definition managed by the platform operator called HelmRelease. These resources map one-to-one to the Helm releases that are deployed to the cluster and are managed by Domino. They abide by a separate reconciliation loop than the Domino resource and are continuously evaluated for drift between the deployed manifest of the Helm release and the live state of the cluster.

While the operator is capable of correcting drift, this behavior is not yet enabled globally or configurable by service through the Domino resource. By default, it will warn of drift on the HelmRelease resource conditions directly. In a future release, this will be surfaced as a configurable option.

Using ddlctl is the best way to inspect the state of HelmRelease resources in your cluster:

# Get all HelmRelease resources in the cluster across namespaces
ddlctl get helmrelease --all

# Get all HelmRelease resources in the domino-platform namespace
ddlctl get helmrelease --namespace domino-platform

# Get all HelmRelease resources in the cluster that are marked as Stalled
ddlctl get helmrelease --all --status stalled=true

A HelmRelease is marked as Stalled when the operator detects that:

  • the Helm release has drifted from the desired state,

  • the Helm release is in a Failed state,

  • the Helm release is locked in a pending state,

  • the Helm release was deleted, or

  • the latest Helm revision does not match the desired revision of the current Domino generation.

To get all HelmRelease resources in the cluster that are marked as Ready, run the following:

ddlctl get helmrelease --all --status ready=true

HelmRelease resources are deployed with a default 5 minute interval, meaning if a release were to get out of sync in the cluster it will not necessarily register as drift immediately, but get picked up on the next reconciliation.

If you want to force a reconciliation, you can do this through the ddlctl command line:

ddlctl reconcile helmrelease nucleus -n domino-platform

Investigating drift

Discovering what has drifted on a HelmRelease resource can be done in a few ways.

ddlctl offers a subcommand for inspecting the diff of a Helm release against the live state of the cluster:

ddlctl diff helmrelease nucleus -n domino-platform

If the resource has drifted, you can expect to see something similar to the following:

NAME                	READY	REASON 	MESSAGE                                           	DRIFT DETECTION MODE	SUSPENDED
domino-data-importer	False	Drifted	cluster has drifted from desired helmrelease state	warn                	false

For resources that are in sync, you can expect to see something more like the following:

NAME   	READY	REASON	MESSAGE                                  	DRIFT DETECTION MODE	SUSPENDED
nucleus	True 	InSync	helmrelease is in sync with cluster state	warn                	false

The operator also writes information on the nature of drift to events, which can be inspected with kubectl describe, i.e.:

# Inspect the events of a Helm release
kubectl describe helmrelease nucleus -n domino-platform

The Warning event will report on the resource where drift was detected, the type of drift, and include the JSON patch (either in full or in part) that would be applied if correct mode were enabled on the HelmRelease resource rather than warn.

Note
As there is a character limit on Kubernetes events, the JSON patch will be truncated to 500 characters max, but the full patch can be found in the operator logs, which can also be accessed with ddlctl by running ddlctl logs operator.