Users in Domino assign their Runs to Domino Hardware Tiers. A hardware tier defines the type of machine a job will run on, and the resource requests and limits for the pod that the Run will execute in. When configuring a hardware tier, you will specify the machine type by providing a Kubernetes node label.
You must create a Kubernetes node label for each type of node you want available for compute workloads in Domino, and apply it consistently to compute nodes that meet that specification. Nodes with the same label become a node pool, and they will be used as available for Runs assigned to a Hardware Tier that points to their label.
Which pool a Hardware Tier is configured to use is determined by the value in the Node Pool field of the Hardware Tier editor. In the screenshot below, the
large-k8s Hardware Tier is configured to use the
default node pool.
The diagram below shows a cluster configured with two node pools for Domino, one named
default and one named
default-gpu. You can make additional node pools available to Domino by labeling them with the same scheme:
dominodatalab.com/node-pool=<node-pool-name>. The arrows in this diagram represent Domino requesting that a node with a given label be assigned to a Run. Kubernetes will then assign the Run to a node in the specified pool that has sufficient resources.
By default, Domino creates a node pool with the label
dominodatalab.com/node-pool=default and all compute nodes Domino creates in cloud environments are assumed to be in this pool. In cloud environments with automatic node scaling, you will configure scaling components like AWS Auto Scaling Groups or Azure Scale Sets with these labels to create elastic node pools.
Every Run in Domino is hosted in a Kubernetes pod on a type of node specified by the selected Hardware Tier.
The pod hosting a Domino Run contains three containers:
The main Run container where user code is executed
An NGINX container for handling web UI requests
An executor support container which manages various aspects of the lifecycle of a Domino execution, like transferring files or syncing changes back to the Domino file system
The amount of compute power required for your Domino cluster will fluctuate over time as users start and stop Runs. Domino relies on Kubernetes to find space for each execution on existing compute resources. In cloud autoscaling environments, if there’s not enough CPU or memory to satisfy a given execution request, the Kubernetes cluster autoscaler will start new compute nodes to fulfill that increased demand. In environments with static nodes, or in cloud environments where you have reached the autoscaling limit, the execution request will be queued until resources are available.
Autoscaling Kubernetes clusters will shut nodes down when they are idle for more than a configurable duration. This reduces your costs by ensuring that nodes are used efficiently, and terminated when not needed.
Cloud autoscaling resources have properties like the minimum and maximum number of nodes they can create. You should set the node maximum to whatever you are comfortable with given the size of your team and expected volume of workloads. All else equal, it is better to have a higher limit than a lower one, as nodes are cheap to start up and shut down, while your users' time is very valuable. If the cluster cannot scale up any further, your users' executions will wait in a queue until the cluster can service their request.
The amount of resources Domino will request for a Run is determined by the selected Hardware Tier for the Run. Each Hardware Tier has five configurable properties that configure the resource requests and limits for Run pods.
The number of requested CPUs.
The maximum number of CPUs. Domino recommends that this be the same as the request.
The amount of requested memory.
The maximum amount of memory. Domino recommends that this be the same as the request.
Number of GPUs
The number of GPU cards available.
The request values, Cores and Memory, as well as Number of GPUs, are thresholds used to determine whether a node has capacity to host the pod. These requested resources are effectively reserved for the pod. The limit values control the amount of resources a pod can use above and beyond the amount requested. If there’s additional headroom on the node, the pod can use resources up to this limit.
However, if resources are in contention, and a pod is using resources beyond those it requested, and thereby causing excess demand on a node, the offending pod might be evicted from the node by Kubernetes and the associated Domino Run is terminated. For this reason, Domino strongly recommends setting the requests and limits to the same values.
To prevent a single user from monopolizing a Domino deployment, an administrator can set a limit on the number of simultaneous executions that a user can have running concurrently. Once the number of simultaneously running executions is reached for a given user, any additional executions will be queued. This includes executions for Domino workspaces, jobs, web applications, as well as any executions that make up an on-demand distributed compute cluster. For example, in the case of an on-demand Spark cluster an execution slot will be consumed for each Spark executor and for the master.
See important settings for details.
From the menu bar in the admin application, click Infrastructure. You can see both Platform and Compute nodes in this interface.
Click the name of a node to get a complete description, including all applied labels, available resources, and currently hosted pods. This is the full
kubectl describefor the node.
Non-Platform nodes with a value in the Node Pool column are compute nodes that can be used for Domino Runs by configuring a Hardware Tier to use the pool.
From the menu bar in the admin application, click Executions. The page lists active Domino execution pods and shows the type of workload, the Hardware Tier used, the originating user and project, and the status for each pod. There are also links to view a full
kubectl describe output for the pod and the node, and an option to download the deployment lifecycle log for the pod generated by Kubernetes and the Domino application.
From the menu bar in the admin application, click Advanced > Hardware Tiers.
On the Hardware Tiers page, click New to create a new Hardware Tier or Edit to modify an existing Hardware Tier.
Your Hardware Tier’s CPU, memory, and GPU requests must not exceed the available resources of the machines in the target node pool after accounting for overhead. If you need more resources than are available on existing nodes, you might have to add a new node pool with different specifications. This might mean adding individual nodes to a static cluster, or configuring new auto-scaling components that provision new nodes with the required specifications and labels.
The following settings in the
common namespace of the Domino central configuration affect compute grid behavior.
Value: Maximum number of executions each user might have running concurrently. If a user tries to run more than this, the excess executions will queue until existing executions finish. Default is 25.
Value: Number of seconds an execution pod that cannot be assigned due to execution quota limitations will wait for resources to become available before timing out. Default is 24 * 60 * 60 (24 hours).