domino logo
About DominoArchitecture
Kubernetes
Cluster RequirementsDomino on EKSDomino Kubernetes Version CompatibilityDomino on GKEDomino on AKSDomino on OpenShiftNVIDIA DGX in DominoDomino in Multi-Tenant Kubernetes ClusterEncryption in Transit
Installation
Installation ProcessConfiguration ReferenceInstaller Configuration ExamplesPrivate or Offline InstallationCustom Certificatesfleetcommand-agent Release NotesInstall Script Downloads
Azure Deployments
Prepare to InstallProvision Infrastructure and Runtime EnvironmentDeploy DominoKubernetes Upgrade Guide
Google Cloud Deployments
Prepare to InstallProvision Infrastructure and Runtime EnvironmentDeploy DominoKubernetes Upgrade Guide
Amazon Web Services Deployments
Prepare to InstallProvision Infrastructure and Runtime EnvironmentDeploy DominoKubernetes Upgrade Guide
Configuration
Central ConfigurationNotificationsFeature FlagsChange The Default Project For New UsersProject Stage ConfigurationDomino Integration With Atlassian Jira
Compute
Manage Domino Compute ResourcesHardware Tier Best PracticesModel Resource QuotasPersistent Volume ManagementAdding a Node Pool to your Domino ClusterRemove a Node from Service
Keycloak Authentication Service
Operations
Domino Application LoggingDomino MonitoringSizing Infrastructure for Domino
Data Management
Data in DominoData Flow In DominoExternal Data VolumesDatasets AdministrationSubmit GDPR Requests
User Management
RolesManage UsersView User InformationRun a User Activity ReportSchedule a User Activity Report
Environments
Environment Management Best PracticesCache Environment Images in EKSImages From Authenticated External Registries
Backup and Restore
Backup StructureBackup LocationCustomize BackupsRun a Manual, On-Demand BackupRestore backups
Control Center
Control Center OverviewExport Control Center Data with The API
Troubleshooting
domino logo
About Domino
Domino Data LabKnowledge BaseData Science BlogTraining
Admin Guide
>
Compute
>
Hardware Tier Best Practices

Hardware Tier Best Practices

Domino Hardware Tiers define Kubernetes requests and limits and link them to specific node pools. Domino recommends the following best practices.

  • Account for overhead

  • Isolate workloads and users using node pools

  • Isolate compute cluster workloads

  • Set resource requests and limits to the same values

Account for overhead

When designing hardware tiers, consider what resources will be available on a given node when Domino submits your workload for execution. Not all physical memory and CPU cores of your node will be available due to system overhead.

Consider the following overhead components:

  1. Kubernetes management overhead

  2. Domino daemon-set overhead

  3. Domino execution sidecar overhead

Kubernetes management overhead

Kubernetes typically reserves a portion of each node’s capacity for daemons and pods that are required for Kubernetes itself. The amount of reserved resources usually scales with the size of the node, and also depends on the Kubernetes provider or distribution.

See the following for information about reserved resources by cloud-provider managed Kubernetes providers:

* https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh#L154[AWS EKS^]
* https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations[Azure AKS^]
* https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#node_allocatable[Google GKE^]

The best way to understand the available resources for your instance is to check one of your compute nodes with the `kubectl describe nodes`
command and then look for the `Allocatable` section of the output.
It will show the memory and CPU available for Domino.

=== Domino daemon-set overhead

Domino runs a set of management pods that reside on each of the compute nodes.
These are used for things like log aggregation, monitoring, and environment image caching.

The overhead of these daemon-sets is roughly 0.5 CPU core and 0.5 Gi RAM.
This overhead is taken from the allocatable resources on the node.

=== Domino execution overhead

Lastly, for each Domino execution, there are a set of supporting containers in the execution pod that manage authentication, handle request routing, loading files, and installing dependencies.
These supporting containers make CPU and memory requests that Kubernetes takes into account when scheduling workspace, job, and app pods.

The supporting container overhead currently is roughly 1 CPU core and
1.5 GiB RAM. This is configurable and might vary for your
deployment.

=== When should I account for overhead?

Overhead is relevant if you want to define a hardware tier dedicated to one execution at a time per node, such as for a node with a single physical GPU.
It is also relevant if you absolutely must maximize node density.

=== Example

Consider an `m5.2xlarge` EC2 node with raw capacity of 8 CPU cores and
32 GiB of RAM.

When used as part of an EKS cluster, the node reports the following
allocatable capacity of ~27GiB of RAM and 7910m CPU cores.

[source,yaml]
----
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         8
  ephemeral-storage:           104845292Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      32120476Ki
  pods:                        58
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         7910m
  ephemeral-storage:           95551679124
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      28372636Ki
  pods:                        58
----

Also, account for 500m CPU and 0.5GiB
RAM for the Domino and EKS daemons.

Lastly, for a single execution add 1000m CPU and 1.5GiB RAM for
sidecars, and you are left with roughly 6410m CPU and 25GiB RAM that you
can use for a single large hardware tier.

If you want to partition the node into smaller hardware tiers, you must account for the sidecar overhead for every execution that you want to colocate.

As a general rule, larger nodes allow for more flexibility as Kubernetes will take care of efficiently packing your executions onto the available capacity.

[[tr1]]
// As an admin, I can see which pods are running on a specific node by clicking on the node's name on the Admin Infrastructure page.
To see which pods are running on a specific node, go to the
*Infrastructure* admin page and click the name of the node.
In the following image, there is a box around the execution pods.
The other pods handle logging, caching, and other services.

link:/images/4.x/admin_guide/pod-info.png[image:/images/4.x/admin_guide/pod-info.png[image]]
















== Isolate workloads and users using node pools

Node pools are defined by labels added to nodes in a specific format:
`dominodatalab.com/node-pool=<your-node-pool>`.
In the hardware tier form, you must include `your-node-pool`.
You can name a node pool anything you like, but Domino recommends naming them something meaningful given the intended use.

[[tr2]]
// As an admin, I can create a hardware tier that uses the Domino "default" node pool.
[[tr3]]
// As an admin, I can create a hardware tier that uses the Domino "default-gpu" node pool.
Domino typically comes pre-configured with `default` and `default-gpu`
node pools, with the assumption that most user executions will run on nodes in one of those pools.
As your compute needs become more sophisticated, you might want to keep certain users separate from one another or provide specialized hardware to certain groups of users.

So if there's a data science team in New York City that needs a specific GPU machine that other teams don't need, you can use the following label for the appropriate nodes:
`dominodatalab.com/node-pool=nyc-ds-gpu`.
In the hardware tier form, you would specify `nyc-ds-gpu`.
To ensure only that team has access to those machines, create a `NYC` organization, add the correct users to the organization, and give that organization access to the new hardware tier that uses the `nyc-ds-gpu` node pool label.


== Isolate compute cluster workloads

Domino on-demand compute clusters often require pooling a large amount of compute resources on specialized hardware (for example, using larger nodes compared to your other workloads).

[[tr4]]
// As an admin, I can restrict a hardware tier to a specific type (or multiple types) of compute cluster.
Consider a use case where you want a set of extra large nodes to be available for on-demand Ray workloads but not available for regular workloads.
Go to  *Advanced > Hardware Tiers* and create or edit a hardware tier. Then, go to *Restrict to compute cluster* and select *Ray*.

For clusters that are comprised of multiple workload types (for example, Ray head and worker), Domino recommends that you create a separate dedicated Hardware Tier that match the requirements for each type.
In the Ray example, you can create `ray-head` and `ray-worker` Hardware Tiers.


== Set resource requests and limits to the same values

With Kubernetes, resource limits must be greater than or equal to resource requests. So if
your memory request is 16 GiB, your limit must be greater than or equal to 16 GiB. But while
setting a request > limit can be useful, there are cases where allowing
bursts of CPU or memory can be useful, this is also dangerous.
Kubernetes might evict a pod using more resources than initially
requested. For Domino workspaces or jobs, this would cause the execution
to be terminated.

For this reason, Domino recommends setting memory and CPU requests equal to limits.
In this case, Python and R cannot allocate more memory than the limit, and execution pods will not be evicted.

On the other hand, if the limit is higher than the request, a user can use resources that another user's execution pod must be able to access.
This is the "noisy neighbor" problem that you might have experienced in other multi-user environments.
But instead of allowing the noisy neighbor to degrade performance for other pods on the node, Kubernetes will evict offending pod when necessary to free up resources.

User data on disk will not be lost, because Domino stores user data on a persistent volume that can be reused.
But anything in memory will be lost and the execution will have to be restarted.
Domino Data LabKnowledge BaseData Science BlogTraining
Copyright © 2022 Domino Data Lab. All rights reserved.