AWS Trainium and Inferentia silicon accelerators

Domino supports the use of cost and energy-efficient AWS-designed silicon processors, AWS Trainum and Inferentia, to accelerate deep-learning model training and AI inference workloads. Use the AWS Neuron SDK to reuse existing code. Learn how to set up Trainium and Inferentia accelerators in your Domino.

Set up involves the following:

  1. Node group creation: Create a new node group for Trainium and Inferentia instances.

  2. Device plugin configuration: Provide hardware-specific settings.

  3. Hardware tier setup: Enable Domino users to use Trainium and Inferentia instances for their workloads.

  4. Environment configuration: Set up the necessary development tools and software libraries.

Node group creation

To use AWS accelerators, create a new node group that:

NamevCPUMemory (GiB)aws.amazon.com/neuronTotal Neuron Memory (GiB)

inf1.xlarge

4

8

1

8

inf1.2xlarge

8

16

1

8

inf1.6xlarge

24

48

4

32

inf1.24xlarge

96

192

16

128

inf2.xlarge

4

16

1

32

inf2.8xlarge

32

128

1

32

inf2.24xlarge

96

384

6

192

inf2.48xlarge

192

768

12

384

trn1.2xlarge

8

32

1

32

trn1.32xlarge

128

512

16

512

For the cluster-autoscaler to scale your Neuron-based node group successfully, you must tag those autoscaling groups with the Neuron device resource template, like the k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron tag in the table below:

KeyValueTag new instances

Name

inferentia-test-domino-trn1-Node

Yes

alpha.eksctl.io/cluster-name

inferentia-test

Yes

alpha.eksctl.io/eksctl-version

0.155.0

Yes

alpha.eksctl.io/nodegroup-name

domino-trn1

Yes

alpha.eksctl.io/nodegroup-type

unmanaged

Yes

aws:cloudformation:logical-id

NodeGroup

Yes

aws:cloudformation:stack-id

arn:aws:cloudformation:us-west-2:873872646799:stack/eksctl-inferentia-test-n…​

Yes

aws:cloudformation:stack-name

eksctl-inferentia-test-nodegroup-domino-trn1

Yes

eksctl.cluster.k8s.io/v1alpha1/cluster-name

inferentia-test

Yes

eksctl.io/v1alpha2/nodegroup-name

domino-trn1

Yes

k8s.io/cluster-autoscaler/enabled

true

Yes

k8s.io/cluster-autoscaler/inferentia-test

owned

Yes

k8s.io/cluster-autoscaler/node-template/label/dominodatalab.com/node-pool

trainium

Yes

k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron

1

Yes

kubernetes.io/cluster/inferentia-test

owned

Yes

Example eksctl node group config

Here’s an example eksctl node group config for Neuron-based node groups:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: inferentia-test
  region: us-west-2

nodeGroups:
  - name: domino-trn1
    instanceType: trn1.2xlarge
    minSize: 0
    maxSize: 3
    desiredCapacity: 1
    volumeSize: 200
    volumeType: gp3
    availabilityZones: ["us-west-2a"]
    labels:
      "dominodatalab.com/node-pool": "trainium"
    tags:
      k8s.io/cluster-autoscaler/node-template/dominodatalab.com/node-pool: trainium
      k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron: 1
    iam:
      withAddonPolicies:
        ebs: true
        efs: true
  - name: domino-inf1
    instanceType: inf1.2xlarge
    minSize: 0
    maxSize: 3
    desiredCapacity: 1
    volumeSize: 200
    volumeType: gp3
    availabilityZones: ["us-west-2a"]
    labels:
      "dominodatalab.com/node-pool": "inferentia"
    tags:
      k8s.io/cluster-autoscaler/node-template/dominodatalab.com/node-pool: inferentia
      k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron: 1
    iam:
      withAddonPolicies:
        ebs: true
        efs: true
  - name: domino-inf2
    instanceType: inf2.2xlarge
    minSize: 0
    maxSize: 3
    desiredCapacity: 1
    volumeSize: 200
    volumeType: gp3
    availabilityZones: ["us-west-2a"]
    labels:
      "dominodatalab.com/node-pool": "inferentia2"
    tags:
      k8s.io/cluster-autoscaler/node-template/dominodatalab.com/node-pool: inferentia2
      k8s.io/cluster-autoscaler/node-template/resources/aws.amazon.com/neuron: 1
    iam:
      withAddonPolicies:
        ebs: true
        efs: true

Device plugin deployment

Once your nodes have joined the cluster, deploy the Neuron device plugin DaemonSet using the following specification. You must use version 2.17.3.0 or greater for Domino workloads to be correctly processed by the device plugin.

To deploy this DaemonSet:

  1. Save the following specification to a file (such as neuron-device-plugin-ds.yaml).

  2. Apply the specification with kubectl apply -f neuron-device-plugin-ds.yaml.

    ---
    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: neuron-device-plugin
    rules:
    - apiGroups:
      - ""
      resources:
      - nodes
      verbs:
      - get
      - list
      - watch
    - apiGroups:
      - ""
      resources:
      - events
      verbs:
      - create
      - patch
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - update
      - patch
      - get
      - list
      - watch
    - apiGroups:
      - ""
      resources:
      - nodes/status
      verbs:
      - patch
      - update
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: neuron-device-plugin
      namespace: kube-system
    ---
    kind: ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: neuron-device-plugin
      namespace: kube-system
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: neuron-device-plugin
    subjects:
    - kind: ServiceAccount
      name: neuron-device-plugin
      namespace: kube-system
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: neuron-device-plugin-daemonset
      namespace: kube-system
    spec:
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          name: neuron-device-plugin-ds
      template:
        metadata:
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ""
          creationTimestamp: null
          labels:
            name: neuron-device-plugin-ds
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node.kubernetes.io/instance-type
                    operator: In
                    values:
                    - inf1.xlarge
                    - inf1.2xlarge
                    - inf1.6xlarge
                    - inf1.24xlarge
                    - inf2.xlarge
                    - inf2.8xlarge
                    - inf2.24xlarge
                    - inf2.48xlarge
                    - trn1.2xlarge
                    - trn1.32xlarge
          containers:
          - env:
            - name: KUBECONFIG
              value: /etc/kubernetes/kubelet.conf
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            image: public.ecr.aws/neuron/neuron-device-plugin:2.17.3.0
            imagePullPolicy: Always
            name: k8s-neuron-device-plugin-ctr
            resources: {}
            securityContext:
              allowPrivilegeEscalation: false
              capabilities:
                drop:
                - ALL
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /var/lib/kubelet/device-plugins
              name: device-plugin
            - mountPath: /run
              name: infa-map
          dnsPolicy: ClusterFirst
          priorityClassName: system-node-critical
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: neuron-device-plugin
          serviceAccountName: neuron-device-plugin
          terminationGracePeriodSeconds: 30
          tolerations:
          - key: CriticalAddonsOnly
            operator: Exists
          - effect: NoSchedule
            key: aws.amazon.com/neuron
            operator: Exists
          volumes:
          - hostPath:
              path: /var/lib/kubelet/device-plugins
              type: ""
            name: device-plugin
          - hostPath:
              path: /run
              type: ""
            name: infa-map
      updateStrategy:
        rollingUpdate:
          maxSurge: 0
          maxUnavailable: 1
        type: RollingUpdate
  3. Once the device plugin DaemonSet is deployed, run kubectl describe node to confirm that you see device plugin daemons running on your Neuron-based instances, and that they advertise aws.amazon.com/neuron resources to Kubernetes.

The following output is an example of a correctly configured Neuron-based node. Note the Neuron device plugin daemon present on the node, the advertised aws.amazon.com/neuron resource, and the Domino node pool label identifying the node as Trainium.

Name:               ip-192-168-42-179.us-west-2.compute.internal
Roles:              <none>
Labels:             alpha.eksctl.io/cluster-name=inferentia-test
                    alpha.eksctl.io/instance-id=i-00549360c9911f4f1
                    alpha.eksctl.io/nodegroup-name=domino-trn1
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=trn1.2xlarge
                    beta.kubernetes.io/os=linux
                    dominodatalab.com/node-pool=trainium
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    k8s.io/cloud-provider-aws=7c4bfb478ecbb2400bead13fc878a3a1
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-42-179.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node-lifecycle=on-demand
                    node.kubernetes.io/instance-type=trn1.2xlarge
                    topology.ebs.csi.aws.com/zone=us-west-2b
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b

... <snip> ...

Capacity:
  aws.amazon.com/neuron:        1
  aws.amazon.com/neuroncore:    2
  aws.amazon.com/neurondevice:  1
  cpu:                          8
  ephemeral-storage:            209702892Ki
  hugepages-1Gi:                0
  hugepages-2Mi:                0
  memory:                       32338380Ki
  pods:                         58
  smarter-devices/fuse:         20
Allocatable:
  aws.amazon.com/neuron:        1
  aws.amazon.com/neuroncore:    2
  aws.amazon.com/neurondevice:  1
  cpu:                          7910m
  ephemeral-storage:            192188443124
  hugepages-1Gi:                0
  hugepages-2Mi:                0
  memory:                       31321548Ki
  pods:                         58
  smarter-devices/fuse:         20
System Info:
  Machine ID:                 ec240d0453aef36a07d5248a753946c5
  System UUID:                ec240d04-53ae-f36a-07d5-248a753946c5
  Boot ID:                    9d066b81-273b-406d-ad0f-6375adebca5d
  Kernel Version:             5.4.253-167.359.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.19
  Kubelet Version:            v1.26.7-eks-8ccc7ba
  Kube-Proxy Version:         v1.26.7-eks-8ccc7ba
ProviderID:                   aws:///us-west-2b/i-00549360c9911f4f1
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------   ---------------  -------------  ---
  domino-compute              run-650b32d5eb80df0d60cec514-kgcc4      5610m (70%)   6600m (83%)  28174Mi (92%)    28174Mi (92%)  3h17m
  domino-platform             aws-ebs-csi-driver-node-lwp9g           30m (0%)      200m (2%)    80Mi (0%)        456Mi (1%)     3h10m
  domino-platform             aws-efs-csi-driver-node-kk7mb           20m (0%)      200m (2%)    40Mi (0%)        200Mi (0%)     3h10m
  domino-platform             docker-registry-cert-mgr-k67bb          0 (0%)        0 (0%)       0 (0%)           0 (0%)         3h10m
  domino-platform             fluentd-ljkhk                           200m (2%)     1 (12%)      600Mi (1%)       2Gi (6%)       3h10m
  domino-platform             image-cache-agent-7ztv6                 0 (0%)        0 (0%)       0 (0%)           0 (0%)         3h10m
  domino-platform             prometheus-node-exporter-vr9bc          0 (0%)        0 (0%)       0 (0%)           0 (0%)         3h10m
  domino-platform             smarter-device-manager-ncsr6            10m (0%)      100m (1%)    15Mi (0%)        15Mi (0%)      3h10m
  kube-system                 aws-node-q89xl                          25m (0%)      0 (0%)       0 (0%)           0 (0%)         3h10m
  kube-system                 kube-proxy-b2qpv                        100m (1%)     0 (0%)       0 (0%)           0 (0%)         3h10m
  kube-system                 neuron-device-plugin-daemonset-h8d7k    0 (0%)        0 (0%)       0 (0%)           0 (0%)         3h10m

Hardware Tier setup

Next, you need to make the node group accessible to your users by creating a Domino hardware tier that does the following:

  • Targets the node pool label you’ve given to your Neuron-based nodes

  • Requests a suitable amount of the node vCPU and memory, allowing for necessary overhead

  • Requests a custom GPU resource with the name aws.amazon.com/neuron

See the following example:

KeyValue

Cluster Type

Kubernetes

ID

trainium

Name

Trainium

Cores Requested

5.0

Memory Requested (GiB)

26.0

Number of GPUs

1

Use custom GPU name

Yes

GPU Resource Name

aws.amazon.com/neuron

Cents Per Minute Per Run

0.0

Node Pool

trainium

Restrict to compute cluster

Options: Spark, Ray, Dask, Mpi

Maximum Simultaneous Executions

Overprovisioning Pods

0

Environment setup

The AWS Neuron SDK is designed for use with fully integrated frameworks like PyTorch and TensorFlow. When setting up a Domino environment for a new version of Neuron or the integrated framework, you should read the documentation on:

As an example, and to facilitate testing, here’s an environment definition for adding PyTorch Neuron to the Domino 5.7 Standard Environment (quay.io/domino/compute-environment-images:ubuntu20-py3.9-r4.3-domino5.7-standard):

# Configure Linux for Neuron repository updates
RUN sudo touch /etc/apt/sources.list.d/neuron.list
RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" | sudo tee -a /etc/apt/sources.list.d/neuron.list
RUN sudo wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

# Update OS packages
RUN sudo apt-get update -y

# Install Neuron Driver
RUN sudo apt-get install aws-neuronx-dkms=2.* -y

# Install Neuron Runtime
RUN sudo apt-get install aws-neuronx-collectives=2.* -y
RUN sudo apt-get install aws-neuronx-runtime-lib=2.* -y

# Install Neuron Tools
RUN sudo apt-get install aws-neuronx-tools=2.* -y

# Add PATH
RUN export PATH=/opt/aws/neuron/bin:$PATH

# pip installs
RUN python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
RUN python -m pip install --user neuronx-cc==2.* torch-neuronx torchvision

Testing Neuron devices in Domino

To test your setup, start a Jupyter workspace using a Neuron-based hardware tier and Neuron-enabled Workspace Environment.

Neuron-enabled Environment

Once your workspace has started, open a Python notebook and execute a cell with the command !/opt/aws/neuron/bin/neuron-ls to see mounted Neuron devices.

List mounted Neuron devices

You can now use the Neuron framework you’ve installed to invoke the mounted accelerator.

Next steps

Refer to the Getting Started with Neuron guide for your chosen framework to get started.