Domino makes it easy to deploy your model to a REST endpoint using Model APIs. Domino streamlines deployment so you can start making calls to your Model API REST endpoints to get predictions from your models quickly. Learn how to:
-
Deploy a model for synchronous inference (real-time).
-
Deploy a model for asynchronous inference.
-
Access Domino assets in a model image.
-
Request predictions from a Model API endpoint.
-
Retrain a model.
-
See an example of a asynchronous inference application.
When you deploy a Model API, Domino packages the project’s files as a Flask application. By default, Domino copies all project files into the model image including the compute environment, project files, a Flask/Plumber harness that exposes the REST interface, and an authentication and load balancing layer.
A Domino Model API is a REST API endpoint wrapped around your inference code. You supply the arguments to your inference code as parameters in the REST request payload, the response from the API includes the prediction. When a Model API is published, Domino runs the script that contains the function. As the process waits for input, any objects or functions from the script remain in memory. Each call to the endpoint runs the function. Since the script is only sourced at publish time, expensive initialization processes occur only once rather than on each call.
Decide if your model needs to be made available as a synchronous or asynchronous endpoint.
- Synchronous inference
-
Deploy the model so it can receive a REST request, make a prediction, and return results synchronously with low latency. This is useful for interactive applications where predictions must be made synchronously.
- Asynchronous inference
-
Deploy the model so it can process and make predictions asynchronously for computationally intensive workloads. The asynchronous REST interface polls for results after a request is queued. After completion, the endpoints returns the result as a response to the polling request.
Domino limits the size of the request and response packets to 10 KB, so users are expected to send and receive the payloads by reference. After Domino picks up the processing request, it keeps the request alive for 30 minutes. Predictions are written to an external data store that you define.
Domino supports a few different methods for model deployment:
-
Deploy from the the UI - For quick, one time deployments.
-
Deploy with a scheduled job - Schedule a job and deploy a model from the job. Especially useful for keeping models trained on the latest data on a regular cadence.
-
Deploy with the API - Use the Domino API to schedule jobs from other applications.
Deploy from the UI
Use the Domino UI to deploy your model directly from your browser.
-
In your project, go to Model APIs > Create Model API.
-
Provide a name, description, and select the prediction code that executes when the model is called.
-
(Optional) Select a custom compute environment to build the deployed model container.
-
(Optional) Configure scaling. The number of instances and compute resources attached to each instance. See Scale model deploymentsfor more information.
-
Under Request Type, select Sync for real-time predictions or Async for long-running predictions.
Deploy with a scheduled job
You can use scheduled jobs to deploy models. This is especially useful for keeping models trained on the latest data. Select the Model API from the Update Model API menu when scheduling a job. This setting uses the state of the project’s files after the run to build and deploy a new Model API.
You can use automated deploy jobs to automatically keep your Model API up-to-date by using a training script in the job that trains on fresh data.
Deploy with the Domino API
Use Domino’s APIS to programmatically build and deploy models. For more information, see the API docs.
Set the isAsync
parameter to true in the API call to create an asynchronous inference endpoint.
You can still access Domino artifacts from model images. However, there are nuances to each artifact type. Learn how to use Domino artifacts from your model image and keep them up to date.
Model API environments
When you deploy a Model API, select the compute environment to include in the model image. The environment bundles packages required by the inference script execution ahead of time.
-
Model API hosts don’t read
requirements.txt
files or execute commands defined in the pre-setup, post-setup, pre-run, or post-run scripts of your environment. If your project usesrequirements.txt
or any setup scripts to install specific packages or repositories, add them to the Dockerfile instructions of your environment. -
Your Model API doesn’t inherit environment variables set at the project level. However, you can set Model API-specific environment variables on the Model settings page. This separation decouples the management of projects and deployed models. See Secure Credential Storage for more details.
Model API project files
Your Model API can access files from the project. Domino loads the project files onto the Model API host, similar to a Run or Workspace executor host, with a few important differences:
-
Domino adds project files to the image when you build a Model API. Starting or stopping the Model API won’t change the files on the model host. To update the model host files, you need to create a new version.
-
Domino pulls Git repositories attached to projects when you build a Model API. Starting or stopping the Model API won’t change the files on the model host. To update the model host files, you need to create a new version.
-
The Model API host mounts project files at
/mnt/<username>/<project_name>
. This location differs from the default location for Runs or Workspaces,/mnt/
. You can use the Domino environment variableDOMINO_WORKING_DIR
to reference the directory where your project is mounted. -
The Model API image excludes project files listed in the
.modelignore
file that are located in the project’s root directory. Excluded files are not mounted to the Model API host.
Add a Kubernetes volume to a synchronous Model API container
When you load inference data or write the response to an external volume, you can add Kubernetes volumes:
-
Select a Model API from the Model APIs page.
-
Go to Settings > Advanced > Add Volume.
Note -
Enter the values required.
-
Name - Kubernetes volume name
-
Mount Path - mount point it the Model API container
-
path - the path of the Kubernetes host node that must be mounted in the Model API containre, as configured by your administrator
-
Read Only? - the read/write permission of the mounted volume
-
See the Kubernetes documentation for more details.
===
After you deploy the Model API, and its status changes to Running, try test inputs with the Tester tab in the Model API UI.
JSON requests
Send your requests as JSON objects. Depending on how you wrote your prediction script, you need to format the JSON request as follows:
-
If you use named parameters in your function definition, use a dictionary or parameter array in your JSON request. For example,
my_function(x, y, z)
, use{"data": {"x": 1, "y": 2, "z": 3}}
or{"parameters": [1, 2, 3]}
. -
If you use a dictionary in your function definition, use only a parameter array. For example,
my_function(dict)
and your function then usesdict["x"]
,dict["y"]
. Send the request:{"parameters": [{"x": 1, "y": 2, "z": 3}]}
. -
In Python, you can also use kwargs to pass in a variable number of arguments. If you do this:
my_function(x, **kwargs)
and your function then useskwargs["y"]
andkwargs["z"]
, you can use a data dictionary to call your model API:{"data": {"x": 1, "y": 2, "z": 3}}
.
Domino converts JSON data types to the following R and Python data types.
JSON Type | Python Type | R Type |
---|---|---|
dictionary | dictionary | named list |
array | list | list |
string | str | character |
number (int) | int | integer |
number (real) | float | numeric |
true | True | TRUE |
false | False | FALSE |
null | None | N/A |
The model API’s returns the result object, which is a literal, array, or dictionary.
Synchronous requests
-
Request a prediction and retrieve the result: Pass the URL of the Domino Model API, authorization token, and input parameters. The response object contains the status, response headers, and result.
response = requests.post("{DOMINO_URL}/models/{MODEL_ID}/latest/model", auth=("{MODEL_ACCESS_TOKEN}"), json={"data": {"start": 1, "stop": 100}} ) print(response.status_code) print(response.headers) print(response.json())
Asynchronous requests
-
Request a prediction: Pass the URL of the Domino Model API, authorization token, and input parameters. Users typically pass a reference to the payload location for large payloads (>10 KB). A prediction identifier is returned for the user to poll for completion and retrieve the result.
prediction_id = requests.post( "{DOMINO_URL}/api/modelApis/async/v1/{MODEL_ID}", headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"}, json={"parameters": { "input_file": "s3://example/filename.ext"}})
-
Poll for completion / retrieve the results: Use the
prediction_id
from the request to retrieve the result. The response includes one of the following statuses:SUCCEEDED
,FAILED
, orQUEUED
. The results field contains the output from the predict function for a successful completion.status_response = requests.get( f"{MODEL_BASE_URL}/{prediction_id}", headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"}, ) if prediction_status == SUCCEEDED_STATUS: # succeeded response includes the prediction result in "result" result = status_response.json()["result"]
- Deploy a new version
-
After you retrain your model with new data or switch to a different machine learning algorithm, publish a new version of the Model API. To follow best practices, stop a previous version of the Model API and then deploy a new version.
This example shows a Python client application that creates a prediction request from an asynchronous Model API, polls periodically for completion, and retrieves the result.
import json
import logging
import requests
import sys
import time
logging.basicConfig(stream=sys.stdout, level=logging.INFO) # change logging setup as required
# TO EDIT: update the example request parameters for your model
REQUEST_PARAMETERS = {
"param1": "value1",
"param2": "value2",
"param3": 3
}
# TO EDIT: copy these values from "Calling your Model" on the Model API overview page
DOMINO_URL = "https://domino.mycompany.com:443"
MODEL_ID = "5a4131c5aad8e00eefb676b7"
MODEL_ACCESS_TOKEN = "o2pnVAqFOrQBEZMCuzt797d676E6k4eS3mZMKJVKbeid8V6Bbig6kOdh6y9YSf3R"
# DO NOT EDIT these values
MODEL_BASE_URL = f"{DOMINO_URL}/api/modelApis/async/v1/{MODEL_ID}"
SUCCEEDED_STATUS = "succeeded"
FAILED_STATUS = "failed"
QUEUED_STATUS = "queued"
TERMINAL_STATUSES = [SUCCEEDED_STATUS, FAILED_STATUS]
PENDING_STATUSES = [QUEUED_STATUS]
MAX_RETRY_DELAY_SEC = 60
### CREATE REQUEST ###
create_response = None
retry_delay_sec = 0
while (
create_response is None
or (500 <= create_response.status_code < 600) # retry for transient 5xx errors
):
# status polling with a time interval that backs off up to MAX_RETRY_DELAY_SEC
if retry_delay_sec > 0:
time.sleep(retry_delay_sec)
retry_delay_sec = min(max(retry_delay_sec * 2, 1), MAX_RETRY_DELAY_SEC)
create_response = requests.post(
MODEL_BASE_URL,
headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"},
json={"parameters": REQUEST_PARAMETERS}
)
if create_response.status_code != 200:
raise Exception(f"create prediction request failed, response: {create_response}")
prediction_id = create_response.json()["asyncPredictionId"]
logging.info(f"prediction id: {prediction_id}")
### POLL STATUS AND RETRIEVE RESULT ###
status_response = None
retry_delay_sec = 0
while (
status_response is None
or (500 <= status_response.status_code < 600) # retry for transient 5xx errors
or (status_response.status_code == 200 and status_response.json()["status"] in PENDING_STATUSES)
):
# status polling with a time interval that backs off up to MAX_RETRY_DELAY_SEC
if retry_delay_sec > 0:
time.sleep(retry_delay_sec)
retry_delay_sec = min(max(retry_delay_sec * 2, 1), MAX_RETRY_DELAY_SEC)
status_response = requests.get(
f"{MODEL_BASE_URL}/{prediction_id}",
headers={"Authorization": f"Bearer {MODEL_ACCESS_TOKEN}"},
)
if status_response.status_code != 200:
raise Exception(f"prediction status request failed, response: {create_response}")
prediction_status = status_response.json()["status"]
if prediction_status == SUCCEEDED_STATUS: # succeeded response includes the prediction result in "result"
result = status_response.json()["result"]
logging.info(f"prediction succeeded, result:\n{json.dumps(result, indent = 2)}")
elif prediction_status == FAILED_STATUS: # failed response includes the error messages in "errors"
errors = status_response.json()["errors"]
logging.error(f"prediction failed, errors:\n{json.dumps(errors, indent = 2)}")
else:
raise Exception(f"unexpected terminal prediction response status: {prediction_status}")