Overview
This template provides a production‑ready OpenLLM instance as a Monk runnable. You can:- Run it directly to deploy and serve large language models
- Inherit it in your own runnable to seamlessly add LLM inference to your AI stack
What this template manages
- OpenLLM container (
ghcr.io/bentoml/openllmimage) - Network service on port 3000
- GPU resource allocation
- Model loading and serving
- Health checks and readiness probes
Quick start (run directly)
- Load templates
- Run OpenLLM with defaults
- Customize model (recommended via inheritance)
variables (Mistral-7B with vLLM backend).
- Preferred: inherit and override variables as shown below.
- Alternative: fork/clone and edit the
variablesinbentoml/openllm.yaml, thenmonk load MANIFESTand run.
localhost:3000 (or the runnable hostname inside Monk networks).
Configuration
Key variables you can customize in this template:Use by inheritance (recommended for AI platforms)
Inherit the OpenLLM runnable in your application and declare a connection. Example:Ports and connectivity
- Service:
webon TCP port3000 - From other runnables in the same process group, use
connection-hostname("\<connection-name>")to resolve the LLM host.
Features and capabilities
- Serve multiple LLM architectures (Llama, Mistral, Falcon, MPT, StarCoder, etc.)
- Production-ready inference with vLLM backend for high throughput
- OpenAI-compatible API endpoints
- GPU acceleration support
- Automatic batching and optimization
- Health checks and monitoring
- Streaming responses
- Model quantization support
Supported Models
OpenLLM supports models from HuggingFace:- Llama 2 and Llama 3 (7B, 13B, 70B)
- Mistral and Mixtral
- Falcon (7B, 40B)
- MPT and MPT-Instruct
- StarCoder and CodeLlama
- And many more from the HuggingFace model hub
Related templates
- Combine with vector databases (
qdrant/,chroma/) for RAG applications - Use with
langfuse/for LLM observability and tracing - Integrate with application frameworks for AI-powered apps
Troubleshooting
- Check server health:
- Check logs:
- For GPU support, ensure NVIDIA drivers, CUDA toolkit, and Docker GPU runtime are installed on the host
- Model loading can take several minutes on first start depending on model size
- Monitor GPU memory usage, LLMs require significant VRAM
- Adjust
--max-model-lenparameter to fit your GPU memory constraints - If container fails to start, check that GPU resources are available