OpenLLM - Monk Docs

Overview
What this template manages
Quick start (run directly)
Configuration
Use by inheritance (recommended for AI platforms)
Ports and connectivity
Features and capabilities
Supported Models
Related templates
Troubleshooting

Overview

This template provides a production‑ready OpenLLM instance as a Monk runnable. You can:

Run it directly to deploy and serve large language models
Inherit it in your own runnable to seamlessly add LLM inference to your AI stack

It exposes OpenLLM on port 3000, supports GPU acceleration, and can serve various models from HuggingFace with configurable backends (vLLM, PyTorch, etc.).

What this template manages

OpenLLM container (ghcr.io/bentoml/openllm image)
Network service on port 3000
GPU resource allocation
Model loading and serving
Health checks and readiness probes

Quick start (run directly)

Load templates

monk load MANIFEST

Run OpenLLM with defaults

monk run bentoml/openllm

Customize model (recommended via inheritance)

Running directly uses the defaults defined in this template’s variables (Mistral-7B with vLLM backend).

Preferred: inherit and override variables as shown below.
Alternative: fork/clone and edit the variables in bentoml/openllm.yaml, then monk load MANIFEST and run.

Once started, query the model at localhost:3000 (or the runnable hostname inside Monk networks).

curl -X POST http://localhost:3000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing",
    "max_tokens": 100
  }'

Configuration

Key variables you can customize in this template:

variables:
  model: "mistralai/Mistral-7B-v0.1"   # HuggingFace model ID
  backend: "vllm"                       # backend (vllm, pt, etc.)
  params: "--max-model-len 8464"        # additional OpenLLM parameters

The container requests 1 GPU by default. Models are downloaded and cached automatically on first start.

Use by inheritance (recommended for AI platforms)

Inherit the OpenLLM runnable in your application and declare a connection. Example:

namespace: myapp
llm:
  defines: runnable
  inherits: bentoml/openllm
  variables:
    model: "meta-llama/Llama-2-7b-chat-hf"
    backend: "vllm"
    params: "--max-model-len 4096"
api:
  defines: runnable
  containers:
    api:
      image: myorg/ai-api
  connections:
    llm-service:
      runnable: llm
      service: web
  variables:
    llm-endpoint:
      value: <- connection-hostname("llm-service")

Then run your app group:

monk run myapp/api

Ports and connectivity

Service: web on TCP port 3000
From other runnables in the same process group, use connection-hostname("\<connection-name>") to resolve the LLM host.

Features and capabilities

Serve multiple LLM architectures (Llama, Mistral, Falcon, MPT, StarCoder, etc.)
Production-ready inference with vLLM backend for high throughput
OpenAI-compatible API endpoints
GPU acceleration support
Automatic batching and optimization
Health checks and monitoring
Streaming responses
Model quantization support

Supported Models

OpenLLM supports models from HuggingFace:

Llama 2 and Llama 3 (7B, 13B, 70B)
Mistral and Mixtral
Falcon (7B, 40B)
MPT and MPT-Instruct
StarCoder and CodeLlama
And many more from the HuggingFace model hub

Combine with vector databases (qdrant/, chroma/) for RAG applications
Use with langfuse/ for LLM observability and tracing
Integrate with application frameworks for AI-powered apps

Troubleshooting

Check server health:

curl http://localhost:3000/healthz

Check logs:

monk logs -l 500 -f bentoml/openllm

For GPU support, ensure NVIDIA drivers, CUDA toolkit, and Docker GPU runtime are installed on the host
Model loading can take several minutes on first start depending on model size
Monitor GPU memory usage, LLMs require significant VRAM
Adjust --max-model-len parameter to fit your GPU memory constraints
If container fails to start, check that GPU resources are available

Ollama

TensorFlow

⌘I

Networking

CDN & DNS

Identity & Auth

Database

Compute

Serverless

Storage

Messaging

Devtools

Analytics Monitoring

Hosting & CI/CD

Payments & Billing

Cache

Web Server

Database Tools

Data Integration

Data Engineering

Communication

Infrastructure

CMS

Observability

DevOps

Big Data

API

Security

Monitoring

Analytics

Automation

Customer Support

Message Broker

Development

Search

AI/ML

Documentation

Social

​Overview

​What this template manages

​Quick start (run directly)

​Configuration

​Use by inheritance (recommended for AI platforms)

​Ports and connectivity

​Features and capabilities

​Supported Models

​Related templates

​Troubleshooting