Deploy Domino endpoints

Domino makes it easy to deploy your model as an HTTP endpoint. Domino streamlines deployment so you can start sending HTTP requests to your endpoint to get predictions from your models quickly. Learn how to:

  • Deploy a model for synchronous inference (real-time).

  • Deploy a model for asynchronous inference.

  • Access Domino assets in a Domino endpoint image.

  • Request predictions from a Domino endpoint.

  • Retrain a model.

  • See an example of an asynchronous inference application.

How deploy works

When you deploy a model as a Domino endpoint, Domino packages the project’s files as a Flask application. By default, Domino copies all project files into the endpoint image, including the compute environment, project files, a Flask/Plumber harness that exposes the HTTP interface, and an authentication and load balancing layer.

A Domino endpoint is an HTTP API endpoint wrapped around your inference code. You supply the arguments to your inference code as parameters in the HTTP request payload, and the response from the endpoint includes the prediction. When a Domino Endpoint is published, Domino runs the script that contains the function. As the process waits for input, any objects or functions from the script remain in memory. Each request sent to the endpoint runs the function. Since the script is only sourced at publish time, expensive initialization processes occur only once rather than on each request.

Synchronous vs asynchronous

Decide if your model needs to be made available as a synchronous or asynchronous Domino endpoint.

Synchronous inference

Deploy the model so it can receive an HTTP request, make a prediction, and return results synchronously with low latency. This is useful for interactive applications where predictions must be made synchronously.

Asynchronous inference

Deploy the model so it can process and make predictions asynchronously for computationally intensive workloads. The asynchronous HTTP interface polls for results after a request is queued. After completion, the endpoint returns the result as a response to the polling request.

Domino limits the size of the request and response packets to 10 KB, so users are expected to send and receive the payloads by reference. After Domino picks up the processing request, it keeps the request alive for 30 minutes. Predictions are written to an external data store that you define.

Package Requirements

To support the Flask harness that exposes a model written in Python, the following packages are required in your endpoint image:

  • uWSGI

  • Flask

  • Six

  • Prometheus-client

Domino will return an error if these packages are not found in your compute environment.

Additionally, asynchronous Domino endpoints require the seldon-core package.

When your model is registered with Domino’s Model Registry feature, the mlflow Python package will attempt to infer package requirements for your model and write them to a requirements.txt file. Domino will install these requirements during the build process. At the moment, no other requirements.txt files are installed when endpoint images are built.

All of the required packages are pre-installed in all the flavors of Domino’s Compute Environment images, except the minimal image, which does not have seldon-core or mlflow.

Deploy a model

Domino supports a few different methods for deploying a model as a Domino endpoint:

  • Deploy from the UI - For quick, one time deployments.

  • Deploy with a scheduled job - Schedule a job and deploy a model from the job. Especially useful for keeping models trained on the latest data on a regular cadence.

  • Deploy with the API - Use the Domino API to schedule jobs from other applications.

Deploy from the UI

Use the Domino UI to deploy your model directly from your browser.

  1. In your project, go to Deployments > Endpoints > Create Domino Endpoint.

  2. Provide a name and description, and select the prediction code that executes when the model is called.

  3. (Optional) Select a custom compute environment to build the deployed model container.

  4. (Optional) Configure scaling. The number of instances and compute resources attached to each instance. See Scale model deployments for more information.

    Note
    You can also include a GPU if your administrator configured GPUs in your node pool.
  5. Under Request Type, select Sync for real-time predictions or Async for long-running predictions.

Update from a scheduled Job

You can use scheduled jobs to update an existing endpoint. This is especially useful for keeping models trained on the latest data. Select the Domino endpoint from the Update Domino Endpoint field when scheduling a job. This setting uses the state of the project’s files after the run to build and deploy a new version of an existing Domino endpoint.

You can use automated deploy jobs to automatically keep your Domino endpoint up to date by using a training script in the job that trains on fresh data.

Deploy with the Domino API

Use Domino’s APIs to programmatically build and deploy models. For more information, see the API docs.

Set the isAsync parameter to true in the API call to create an asynchronous inference endpoint.

Deploy Domino Endpoints on a remote data plane

Users may prefer to host their endpoints in a separate cluster or data plane than where the model was built, either for scalability or data proximity reasons. Domino facilitates this through its Create Endpoint feature, allowing models to be packaged and deployed close to where they’ll have the most impact.

When deploying Model Endpoints to a remote data plane, select the hardware tier associated with the desired local or remote data plane from the Deployment Target list (previously resource quotas) in the New Model Endpoint form.

Domino containerizes the model (MLflow or Project) as a Domino endpoint image, deploys it as an endpoint, and exposes a URL local to that remote cluster where a Nexus data plane has been deployed. Users can invoke it via the exposed URL and data movement is restricted to this remote data plane only.

Limitations: Currently, Domino endpoints deployed to a remote data plane do not support asynchronous requests (described below) and integrated model monitoring.

Access Domino artifacts from endpoint images

You can still access Domino artifacts from endpoint images. However, there are nuances to each artifact type. Learn how to use Domino artifacts from your endpoint image and keep them up to date.

Domino endpoint environments

When you deploy a Domino endpoint, select the compute environment to include in the endpoint image. The environment bundles packages required by the inference script execution ahead of time.

  • Domino endpoint hosts don’t read requirements.txt files or execute commands defined in the pre-setup, post-setup, pre-run, or post-run scripts of your environment. If your project uses requirements.txt or any setup scripts to install specific packages or repositories, add them to the Dockerfile instructions of your environment.

  • Your Domino endpoint doesn’t inherit environment variables set at the project level. However, you can set Domino endpoint-specific environment variables on the Domino endpoint settings page. This separation decouples the management of projects and deployed models. See Store Project credentials for more details.

  • Domino endpoints run using the uid and gid of 12574 (the domino user). A user with this uid and gid must exist in the selected environment for the Domino endpoint. See Create a Domino environment image for more details and instructions on how to use an existing image to create an environment.

  • Domino endpoint image builds run using the USER directive last specified in the Dockerfile for the selected compute environment. Please ensure that the environment image sets the desired user for the Domino endpoint image build (e.g. USER root if you want the Domino endpoint image build to run as root).

Domino endpoint project files

Your Domino endpoint can access files from the project. Domino loads the project files onto the Domino endpoint host, similar to a Run or Workspace executor host, with a few important differences:

  • Domino adds project files to the image when you build a Domino endpoint. Starting or stopping the Domino endpoint won’t change the files on the model host. To update the model host files, you need to create a new version.

  • Domino pulls Git repositories attached to projects when you build a Domino endpoint. Starting or stopping the Domino endpoint won’t change the files on the model host. To update the model host files, you need to create a new version.

  • The Domino endpoint host mounts project files at /mnt/<username>/<project_name>. This location differs from the default location for Runs or Workspaces, /mnt/. You can use the Domino environment variable DOMINO_WORKING_DIR to reference the directory where your project is mounted.

  • The Domino endpoint image excludes project files listed in the .modelignore file that are located in the project’s root directory. Excluded files are not mounted to the Domino endpoint host.

Add a Kubernetes volume to a synchronous Domino endpoint container

When you load inference data or write the response to an external volume, you can add Kubernetes volumes:

  1. Select a Domino endpoint from the Endpoints page.

  2. Go to Settings > Advanced > Add Volume.

    Note
    Only hostPaths are supported.
  3. Enter the values required.

    1. Name - Kubernetes volume name.

    2. Mount Path - mount point in the Domino endpoint container.

    3. path - the path of the Kubernetes host node that must be mounted in the Domino endpoint container, as configured by your administrator.

    4. Read Only? - the read/write permission of the mounted volume.

See the Kubernetes documentation for more details.

Request predictions

After you deploy the Domino endpoint, and its status changes to Running, try test inputs with the Tester tab in the Domino endpoint UI.

Tip
Use the Request window to make calls to the Domino endpoint from the Domino web application. You will find additional tabs with code samples to send requests to the Domino endpoint with other tools and in various programming languages.

JSON requests

Send your requests as JSON objects. Depending on how you wrote your prediction script, you need to format the JSON request as follows:

  • If you use named parameters in your function definition, use a dictionary or parameter array in your JSON request. For example, my_function(x, y, z), use {"data": {"x": 1, "y": 2, "z": 3}} or {"parameters": [1, 2, 3]}.

  • If you use a dictionary in your function definition, use only a parameter array. For example, my_function(dict) and your function then uses dict["x"], dict["y"]. Send the request: {"parameters": [{"x": 1, "y": 2, "z": 3}]}.

  • In Python, you can also use kwargs to pass in a variable number of arguments. If you do this: my_function(x, **kwargs) and your function then uses kwargs["y"] and kwargs["z"], you can use a data dictionary to call your endpoint: {"data": {"x": 1, "y": 2, "z": 3}}.

Domino converts JSON data types to the following R and Python data types.

JSON TypePython TypeR Type

dictionary

dictionary

named list

array

list

list

string

str

character

number (int)

int

integer

number (real)

float

numeric

true

True

TRUE

false

False

FALSE

null

None

N/A

The endpoint returns the result object, which is a literal, array, or dictionary.

Synchronous requests

  • Request a prediction and retrieve the result: Pass the URL of the Domino Domino endpoint, authorization token, and input parameters. The response object contains the status, response headers, and result.

    response = requests.post("{DOMINO_URL}/models/{MODEL_ID}/latest/model",
       auth=("{MODEL_ACCESS_TOKEN}", "{MODEL_ACCESS_TOKEN}"),
       json={"data": {"start": 1, "stop": 100}}
    )
    
    print(response.status_code)
    print(response.headers)
    print(response.json())

Asynchronous requests

Note
  • Domino cannot guarantee the processing order of predictions. The order in which predictions are handled can differ from the order of requests.

  • The same prediction request may execute them more than once, so the prediction functions must be idempotent (not affected by repeated executions).

  • Use model environment variables for secrets.

  • Request a prediction: Pass the URL of the Domino Domino endpoint, authorization token, and input parameters. Users typically pass a reference to the payload location for large payloads (>10 KB). A prediction identifier is returned for the user to poll for completion and retrieve the result.

    prediction_id = requests.post(
       "{DOMINO_URL}/api/modelApis/async/v1/{MODEL_ID}",
       headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"},
          json={"parameters": {
    "input_file": "s3://example/filename.ext"}})
  • Poll for completion / retrieve the results: Use the prediction_id from the request to retrieve the result. The response includes one of the following statuses: SUCCEEDED, FAILED, or QUEUED. The results field contains the output from the predict function for a successful completion.

    status_response = requests.get(
        f"{MODEL_BASE_URL}/{prediction_id}",
        headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"},
    )
    
    
    if prediction_status == SUCCEEDED_STATUS:  # succeeded response includes the prediction result in "result"
       result = status_response.json()["result"]

Retrain a model

Deploy a new version

After you retrain your model with new data or switch to a different machine learning algorithm, publish a new version of the Domino endpoint. To follow best practices, stop a previous version of the Domino endpoint and then deploy a new version.

Asynchronous example

This example shows a Python client application that creates a prediction request from an asynchronous Domino endpoint, polls periodically for completion, and retrieves the result.

import json
import logging
import requests
import sys
import time

logging.basicConfig(stream=sys.stdout, level=logging.INFO)  # change logging setup as required

# TO EDIT: update the example request parameters for your model
REQUEST_PARAMETERS = {
    "param1": "value1",
    "param2": "value2",
    "param3": 3
}
# TO EDIT: copy these values from "Calling your Model" on the Domino endpoint overview page
DOMINO_URL = "https://domino.mycompany.com:443"
MODEL_ID = "5a4131c5aad8e00eefb676b7"
MODEL_ACCESS_TOKEN = "o2pnVAqFOrQBEZMCuzt797d676E6k4eS3mZMKJVKbeid8V6Bbig6kOdh6y9YSf3R"

# DO NOT EDIT these values
MODEL_BASE_URL = f"{DOMINO_URL}/api/modelApis/async/v1/{MODEL_ID}"
SUCCEEDED_STATUS = "succeeded"
FAILED_STATUS = "failed"
QUEUED_STATUS = "queued"
TERMINAL_STATUSES = [SUCCEEDED_STATUS, FAILED_STATUS]
PENDING_STATUSES = [QUEUED_STATUS]
MAX_RETRY_DELAY_SEC = 60

### CREATE REQUEST ###

create_response = None
retry_delay_sec = 0
while (
        create_response is None
        or (500 <= create_response.status_code < 600)  # retry for transient 5xx errors
):
    # status polling with a time interval that backs off up to MAX_RETRY_DELAY_SEC
    if retry_delay_sec > 0:
        time.sleep(retry_delay_sec)
    retry_delay_sec = min(max(retry_delay_sec * 2, 1), MAX_RETRY_DELAY_SEC)

    create_response = requests.post(
        MODEL_BASE_URL,
        headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"},
        json={"parameters": REQUEST_PARAMETERS}
    )

if create_response.status_code != 200:
    raise Exception(f"create prediction request failed, response: {create_response}")

prediction_id = create_response.json()["asyncPredictionId"]
logging.info(f"prediction id: {prediction_id}")

### POLL STATUS AND RETRIEVE RESULT ###

status_response = None
retry_delay_sec = 0
while (
        status_response is None
        or (500 <= status_response.status_code < 600)  # retry for transient 5xx errors
        or (status_response.status_code == 200 and status_response.json()["status"] in PENDING_STATUSES)
):
    # status polling with a time interval that backs off up to MAX_RETRY_DELAY_SEC
    if retry_delay_sec > 0:
        time.sleep(retry_delay_sec)
    retry_delay_sec = min(max(retry_delay_sec * 2, 1), MAX_RETRY_DELAY_SEC)

    status_response = requests.get(
        f"{MODEL_BASE_URL}/{prediction_id}",
        headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"},
    )

if status_response.status_code != 200:
    raise Exception(f"prediction status request failed, response: {create_response}")

prediction_status = status_response.json()["status"]
if prediction_status == SUCCEEDED_STATUS:  # succeeded response includes the prediction result in "result"
    result = status_response.json()["result"]
    logging.info(f"prediction succeeded, result:
{json.dumps(result, indent = 2)}")
elif prediction_status == FAILED_STATUS:  # failed response includes the error messages in "errors"
    errors = status_response.json()["errors"]
    logging.error(f"prediction failed, errors:
{json.dumps(errors, indent = 2)}")
else:
    raise Exception(f"unexpected terminal prediction response status: {prediction_status}")

Use Spot instances for Domino Endpoints (PREVIEW)

We support serving Domino endpoints using cost-effective Spot instances. Select a hardware tier that uses node pool using Spot instances.

If AWS interrupts a spot instance, endpoints deployed on that instance will be affected and either stop responding, if there are no remaining replicas of the endpoint on other AWS instances, or the remaining replicas still running on unaffected instances will experience an increased load and may have their performance degraded or also stop working, depending on the runtime characteristics of the model. If this happens, the remediation is to change the hardware tier of the endpoint to use a non-spot node pool until AWS spot instances of the requested type become available again.