Register and deploy LLMs

Domino lets you register large language models (LLMs) and deploy them as hosted endpoints with optimized inference. These endpoints provide OpenAI-compatible APIs that your applications and agentic systems can call.

You can register models from Hugging Face or from your experiment runs, then deploy them as endpoints. You’ll need Project Collaborator permissions to register models and create endpoints.

Planning your endpoint deployment

Before creating an endpoint, consider these key factors to ensure optimal performance and cost-efficiency:

Understand your model’s requirements

  • Check your model’s documentation for minimum memory and compute requirements.

  • Choose appropriate resource sizes based on requirements and expected usage patterns.

Size resources appropriately for expected usage

  • Account for concurrent users: if you expect high throughput or multiple simultaneous requests, minimal GPU sizes may cause slowdowns. Scale up the hardware tier or consider deploying multiple endpoints.

  • Balance performance against cost. Start with a tier that meets your requirements and monitor performance before scaling up.

Step 1: Register a model

Register a model to make it available for deployment as an endpoint. Go to Models > Register to get started.

Register a model
  1. Choose your model source:

    1. Hugging Face models that you have access to, or

    2. Experiment runs that include a logged MLflow model

  2. Complete the required fields.

Step 2: Create an endpoint

After registering a model, you can deploy it as an endpoint. From your registered model’s Endpoints tab, click Create endpoint.

Register an agentic endpoint
  1. Complete the endpoint configuration details and choose a model source environment and resource size.

  2. Configure access controls by adding users or organizations for access to this endpoint.

  3. Click Create endpoint.

The endpoint deploys with the vLLM runtime, which provides optimized inference performance and OpenAI-compatible APIs.

Step 3: Monitor endpoint performance

After deploying your endpoint, you can monitor its performance and usage.

Register an agentic endpoint
  • Overview: Configuration details and deployment status

  • Performance: Token usage and latency metrics over time.

  • Usage: Endpoint invocation frequency

Monitor model endpoint performance has more detailed information about using monitoring capabilities during model development and after deployment to make sure your models perform efficiently and reliably in production environments.

Next steps