Executor maintenance




Overview

Domino maintains a pool of machines called executors, organized into hardware tiers for use in Domino Runs. Domino system administrators can monitor and take actions on executors from the Admin interface, and this document will describe how to monitor and work with the fleet of executors in your Domino deployment.




Monitoring and working with executors

Domino system administrators can click Dispatcher from the Admin home to view and manage executors in the deployment. This interface features live updating, and shows all currently configured hardware tiers and executors, with information about usage and current state for each.

Screen_Shot_2019-04-09_at_11.10.24_AM.png




Executor state

In the Dispatcher interface, you can see information about your executors. You will see flags for the following states on executors.

  • Available

    Available executors are ready for use, and may be assigned Runs, unless they have been manually put in Maintenance Mode.

  • Unusable

    Domino will not assign Runs to executors in an Unusable state, and executors in an Unusable state do count against the total number of allowed executors in their hardware tiers. This means that a large number of executors in an Unusable state can fill the capacity of a hardware tier and limit its availability. If this occurs, system administrators should either transfer take action to manually terminate them, wait for the executors to be automatically terminated, or put the executors into Maintenance Mode to attempt to repair them.

  • Maintenance Mode

    You will see a flag indicating executors that have been put into Maintenance Mode, plus an optional comment from the admin who toggled Maintenance Mode on for that executor.




Taking actions on executors

The rightmost column for each executor in the Dispatcher interface is an Actions link that opens an interface where Domino system administrators can take the following actions.

Screen_Shot_2019-04-09_at_11.44.38_AM.png




Health checks

Domino executors are subject to several periodic health checks. There are two checks that test if the Dispatcher is able to connect to vital services running on the executor.

However, there is also a configurable health check for disk space. If com.cerebro.domino.executor.minUsableSpaceInGB is set to a non-zero value, disk space health checks will run and executors will fail the health check if their available disk space is lower than the minimum of the following two configuration options.


Namespace: common
Key: com.cerebro.domino.executor.diskSpaceRunsGarbageCollectorFreeSpaceLimit
Value: number of bytes
Default: 50000000000 (this is ~50GB in bytes)

Namespace: common
Key: com.cerebro.domino.executor.minUsableSpaceInGB
Value: number of gigabytes
Default: 0

If this option is set to its default value of 0, disk space health
checks will be disabled and will not run or impact your executors.

These options can be configured in the Central Config.




Health check failures

If an executor fails a health check, the following process occurs.

  1. On the next Dispatcher tick, the Dispatcher will stops scheduling Runs on the executor.
  2. The executor goes idle once any existing Runs finish.
  3. After 5 minutes (configurable) of idle, the executor enters an Unusable state visible in its Executor State column on the Dispatcher interface.
  4. After 15 minutes (configurable) in Unusable state, the executor is stopped.
  5. 48 hours (configurable) after stopping, executors in an Unusable state are terminated.



Unresponsive executors

When previously functional executors become completely unresponsive, Domino executes the following process.

  1. After 15-minutes (configurable) of being unresponsive, the executor is placed in an Unusable state and stopped immediately.
  2. 48 hours (configurable) after stopping the instance is terminated.



Executors dead on arrival

When Domino attempts to start a new executor that never becomes responsive, the following process occurs.

  1. After 15-minutes (configurable) of being unresponsive, the executor is placed in an Unusable state and stopped immediately.
  2. 48 hours (configurable) after stopping the instance is terminated.



Maintenance Mode

From the Actions interface for an individual executor, Domino system administrators can enable Maintenance Mode on the executor. This does the following things.

  • Executors in Maintenance Mode will not be assigned new Runs by the Dispatcher.
  • Executors in Maintenance Mode will not be automatically terminated by Unusable state timeouts.
  • Executors in Maintenance Mode do not count against the total executor limits for their hardware tiers.
  • An executor that is responsive and has been in Maintenance Mode for 120 minutes (configurable) will be stopped.
  • An executor that is unresponsive and has been in Maintenance Mode for 15 minutes (configurable) will be stopped.
  • An executor that is passing its health checks while in Maintenance Mode will attempt to rejoin the pool of Available executors in its hardware tier when Maintenance Mode is toggled off.

Domino system administrators should consider putting executors in an Unusable state into Maintenance Mode if they believe the executor can be fixed and restored to healthy operation, or if they want to attempt to recover data from the executor and thus want to exempt it from automatic termination.




Configurable timeout settings

Namespace: common
Key: com.cerebro.domino.executor.maxIdleMaintenanceModeTimeInMinutes
Value: number of minutes
Default: 120

This is the time a machine can be responsive, running, and idle in
Maintenance Mode before it will be automatically stopped.

Namespace: common
Key: com.cerebro.domino.dispatcher.clusterHealthMonitoring.unhealthyExecutorMMTimeout
Value: JODA duration
Default: 15m

This is the duration before an unresponsive executor in Maintenance
Mode will be stopped, and the duration before an unresponsive
executor will be set to an Unusable state.

Namespace: common
Key: com.cerebro.domino.executor.healthCheckTimeoutInMinutes
Value: number of minutes
Default: 5

This is the duration before an Available executor that is failing
health checks will be put into an Unusable state.

Namespace: common
Key: com.cerebro.domino.dispatcher.clusterHealthMonitoring.unusable2StoppedExecutorTimeoutMin
Value: number of minutes
Default: 5

This is the duration a machine in an Unusable state can remain idle
before being stopped.

Namespace: common
Key: com.cerebro.domino.dispatcher.clusterHealthMonitoring.unusable2TerminatedExecutorTimeoutMin
Value: number of minutes
Default: 2880

This is the duration a machine in an Unusable state can remain
stopped before being terminated.

These options can be set in the Central Config.