| Roadmap | Support Matrix | Docs | Recipes | Examples | Prebuilt Containers | Design Proposals | Blogs
High-throughput, low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
Large language models exceed single-GPU capacity. Tensor parallelism spreads layers across GPUs but creates coordination challenges. Dynamo closes this orchestration gap.
Dynamo is inference engine agnostic (supports TRT-LLM, vLLM, SGLang) and provides:
- Disaggregated Prefill & Decode – Maximizes GPU throughput with latency/throughput trade-offs
- Dynamic GPU Scheduling – Optimizes performance based on fluctuating demand
- LLM-Aware Request Routing – Eliminates unnecessary KV cache re-computation
- Accelerated Data Transfer – Reduces inference response time using NIXL
- KV Cache Offloading – Leverages multiple memory hierarchies for higher throughput
Built in Rust for performance and Python for extensibility, Dynamo is fully open-source with an OSS-first development approach.
| SGLang | TensorRT-LLM | vLLM | |
|---|---|---|---|
| Best For | High-throughput serving | Maximum performance | Broadest feature coverage |
| Disaggregated Serving | ✅ | ✅ | ✅ |
| KV-Aware Routing | ✅ | ✅ | ✅ |
| SLA-Based Planner | ✅ | ✅ | ✅ |
| KVBM | 🚧 | ✅ | ✅ |
| Multimodal | ✅ | ✅ | ✅ |
| Tool Calling | ✅ | ✅ | ✅ |
Full Feature Matrix → — Detailed compatibility including LoRA, Request Migration, Speculative Decoding, and feature interactions.
- [12/05] Moonshot AI's Kimi K2 achieves 10x inference speedup with Dynamo on GB200
- [12/02] Mistral AI runs Mistral Large 3 with 10x faster inference using Dynamo
- [12/01] InfoQ: NVIDIA Dynamo simplifies Kubernetes deployment for LLM inference
| Path | Use Case | Time | Requirements |
|---|---|---|---|
| Local Quick Start | Test on a single machine | ~5 min | 1 GPU, Ubuntu 24.04 |
| Kubernetes Deployment | Production multi-node clusters | ~30 min | K8s cluster with GPUs |
| Building from Source | Contributors and development | ~15 min | Ubuntu, Rust, Python |
Want to help shape the future of distributed LLM inference? See the Contributing Guide.
The following examples require a few system level packages. Recommended to use Ubuntu 24.04 with a x86_64 CPU. See docs/reference/support-matrix.md
Containers have all dependencies pre-installed. No setup required.
# SGLang
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.8.1
# TensorRT-LLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.8.1
# vLLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.8.1Tip: To run frontend and worker in the same container, either run processes in background with
&(see below), or open a second terminal and usedocker exec -it <container_id> bash.
See Release Artifacts for available versions.
The Dynamo team recommends the uv Python package manager, although any way works.
# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv venv
source venv/bin/activate
uv pip install pipInstall system dependencies and the Dynamo wheel for your chosen backend:
SGLang
sudo apt install python3-dev
uv pip install "ai-dynamo[sglang]"Note: For CUDA 13 (B300/GB300), the container is recommended. See SGLang install docs for details.
TensorRT-LLM
sudo apt install python3-dev
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"Note: TensorRT-LLM requires
pipdue to a transitive Git URL dependency thatuvdoesn't resolve. We recommend using the TensorRT-LLM container for broader compatibility.
vLLM
sudo apt install python3-dev libxcb1
uv pip install "ai-dynamo[vllm]"Tip (Optional): Before running Dynamo, verify your system configuration with
python3 deploy/sanity_check.py
Dynamo provides a simple way to spin up a local set of inference components including:
- OpenAI Compatible Frontend – High performance OpenAI compatible http api server written in Rust.
- Basic and Kv Aware Router – Route and load balance traffic to a set of workers.
- Workers – Set of pre-configured LLM serving engines.
Start the frontend:
Tip: To run in a single terminal (useful in containers), append
> logfile.log 2>&1 &to run processes in background. Example:python3 -m dynamo.frontend --store-kv file > dynamo.frontend.log 2>&1 &
# Start an OpenAI compatible HTTP server with prompt templating, tokenization, and routing.
# For local dev: --store-kv file avoids etcd (workers and frontend must share a disk)
python3 -m dynamo.frontend --http-port 8000 --store-kv fileIn another terminal (or same terminal if using background mode), start a worker for your chosen backend:
# SGLang
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --store-kv file
# TensorRT-LLM
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --store-kv file
# vLLM (note: uses --model, not --model-path)
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --store-kv file \
--kv-events-config '{"enable_kv_cache_events": false}'Note: For dependency-free local development, disable KV event publishing (avoids NATS):
- vLLM: Add
--kv-events-config '{"enable_kv_cache_events": false}'- SGLang: No flag needed (KV events disabled by default)
- TensorRT-LLM: No flag needed (KV events disabled by default)
TensorRT-LLM only: The warning
Cannot connect to ModelExpress server/transport error. Using direct download.is expected and can be safely ignored.See Service Discovery and Messaging for details.
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream":false,
"max_tokens": 300
}' | jqRerun with curl -N and change stream in the request to true to get the responses as soon as the engine issues them.
For production deployments on Kubernetes clusters with multiple GPUs.
- Kubernetes cluster with GPU nodes
- Dynamo Platform installed
- HuggingFace token for model downloads
Pre-built deployment configurations for common models and topologies:
| Model | Framework | Mode | GPUs | Recipe |
|---|---|---|---|---|
| Llama-3-70B | vLLM | Aggregated | 4x H100 | View |
| DeepSeek-R1 | SGLang | Disaggregated | 8x H200 | View |
| Qwen3-32B-FP8 | TensorRT-LLM | Aggregated | 8x GPU | View |
See recipes/README.md for the full list and deployment instructions.
For contributors who want to build Dynamo from source rather than installing from PyPI.
Ubuntu:
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
macOS:
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install cmake protobuf
## Check that Metal is accessible
xcrun -sdk macosx metal
If Metal is accessible, you should see an error like metal: error: no input files, which confirms it is installed correctly.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
Follow the instructions in uv installation guide to install uv if you don't have uv installed. Once uv is installed, create a virtual environment and activate it.
- Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh- Create a virtual environment
uv venv dynamo
source dynamo/bin/activateuv pip install pip maturin
Maturin is the Rust<->Python bindings build tool.
cd lib/bindings/python
maturin develop --uv
The GPU Memory Service is a Python package with a C++ extension. It requires only Python development headers and a C++ compiler (g++).
cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_servicecd $PROJECT_ROOT
uv pip install -e .
python3 -m dynamo.frontend- Pass
--store-kv fileto avoid external dependencies (see Service Discovery and Messaging) - Set
DYN_LOGto adjust the logging level (e.g.,export DYN_LOG=debug). Uses the same syntax asRUST_LOG
Note: VSCode and Cursor users can use the
.devcontainerfolder for a pre-configured dev environment. See the devcontainer README for details.
Dynamo provides comprehensive benchmarking tools:
- Benchmarking Guide – Compare deployment topologies using AIPerf
- SLA-Driven Deployments – Optimize deployments to meet SLA requirements
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at /openapi.json. To generate without running the server:
cargo run -p dynamo-llm --bin generate-frontend-openapiThis writes to docs/frontends/openapi.json.
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources (CRDs + EndpointSlices) handle service discovery. External services are optional for most deployments:
| Deployment | etcd | NATS | Notes |
|---|---|---|---|
| Local Development | ❌ Not required | ❌ Not required | Pass --store-kv file; vLLM also needs --kv-events-config '{"enable_kv_cache_events": false}' |
| Kubernetes | ❌ Not required | ❌ Not required | K8s-native discovery; TCP request plane |
Note: KV-Aware Routing requires NATS for prefix caching coordination.
For Slurm or other distributed deployments (and KV-aware routing):
To quickly setup both: docker compose -f deploy/docker-compose.yml up -d
See SGLang on Slurm and TRT-LLM on Slurm for deployment examples.


