Install
# Via pip into a Python 3.10+ venv (or uv; see /tutorials/uv-fast-python-toolchain.html)
uv init vllm-server && cd vllm-server
uv add vllm
# Or pip in a regular venv
python3 -m venv .venv && source .venv/bin/activate
pip install vllm
vLLM needs CUDA (NVIDIA), ROCm (AMD), or works on CPU as a fallback (very slow; only for testing). For production: NVIDIA L40S / A100 / H100, AMD MI300X, or consumer cards like RTX 4090 / 5090 for smaller models.
Serve a model with the OpenAI-compatible API
vllm serve mistralai/Mistral-Small-Instruct-2509 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--tensor-parallel-size 1
That's it. vLLM loads the model from HuggingFace, allocates a KV cache sized to fit the remaining GPU memory, and exposes an OpenAI-compatible API at http://localhost:8000/v1.
Any OpenAI client works:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
model="mistralai/Mistral-Small-Instruct-2509",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
The throughput story
vLLM's killer feature is continuous batching: requests are batched at the token level, not the request level. When request A finishes a token, A's batch slot can immediately start a different request's next token, without waiting for the entire batch to finish. Plus PagedAttention: the KV cache is paged (analogous to virtual memory) instead of contiguous, allowing dramatically tighter memory packing.
For a server handling many concurrent users, this translates to:
- 5-20x higher throughput (tokens/sec aggregate) than serving with HuggingFace Transformers or Ollama.
- Predictable P99 latency under load.
- Better GPU utilization — an H100 actually busy at 80-90% instead of single-stream Ollama's 30-40%.
Multi-GPU: tensor and pipeline parallelism
For models that don't fit on one GPU:
# Spread one model's layers across 4 GPUs via tensor parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 16384
# Or pipeline parallelism (assign different layer ranges to different GPUs)
vllm serve very-large-model \
--pipeline-parallel-size 2 \
--tensor-parallel-size 2
Combined with multi-node deployment (one process per node, pipeline-parallel across them), vLLM serves frontier-scale models (200B+ active) on commodity hardware clusters.
Quantization for smaller VRAM
# INT8 / FP8 quantization (faster, less VRAM)
vllm serve mistralai/Mistral-Small-Instruct-2509 \
--quantization fp8
# AWQ / GPTQ pre-quantized models
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
--quantization awq
# GGUF (in vLLM 0.7+) — the format Ollama uses
vllm serve some-org/Llama-3.1-8B-Instruct-GGUF
FP8 on Hopper (H100) gives nearly the same quality as FP16 at half the memory + faster compute. AWQ is the sweet spot for consumer GPUs; reduces a 70B model from 140 GB to ~40 GB.
Structured output guidance
vLLM supports constrained decoding — force the model to output valid JSON / regex-matching / from a grammar:
resp = client.chat.completions.create(
model="...",
messages=[...],
response_format={
"type": "json_schema",
"json_schema": {
"name": "extract",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
},
"required": ["name"]
}
}
},
)
The model's logits are masked at generation time to only allow tokens that lead to a valid JSON shape. Output is guaranteed schema-conformant; no "model said it would JSON-format but added prose around it" failures.
Prefix caching
For workloads where many requests share long prefixes (a chat with a 4 KB system prompt, RAG queries with a fixed instruction header), prefix caching reuses the computed KV cache for the shared portion across requests:
vllm serve ... --enable-prefix-caching
Throughput on workloads with long shared prefixes improves 2-5x.
Speculative decoding
Use a small "draft" model to propose K tokens, then have the big model verify them in parallel. When the draft agrees, you got K tokens in one forward pass:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--speculative_model meta-llama/Llama-3.2-1B-Instruct \
--num_speculative_tokens 5
For workloads where the draft model agrees with the big one most of the time (typical English text), end-to-end latency drops 1.5-2x.
Production deployment
# Systemd unit
[Unit]
Description=vLLM server
After=network-online.target
[Service]
User=vllm
Group=vllm
Environment=HF_HOME=/var/lib/vllm/huggingface
Environment=VLLM_NO_USAGE_STATS=1
ExecStart=/opt/vllm/bin/vllm serve mistralai/Mistral-Small-Instruct-2509 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--enable-prefix-caching \
--quantization fp8
Restart=always
[Install]
WantedBy=multi-user.target
Or as a container:
docker run --gpus all \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-Small-Instruct-2509 \
--port 8000
Monitoring
vLLM exposes Prometheus metrics at /metrics:
- Request count + latency histograms
- Token throughput (prompt vs generated)
- GPU memory + KV-cache usage
- Queue depth + scheduler stats
Pair with Prometheus + Grafana (see that tutorial) for production observability. The KV-cache utilization graph is the most useful single chart for capacity planning.
vLLM vs alternatives
- Ollama (see that tutorial) — great UX, dev-focused, single-user model. vLLM trades UX for throughput.
- llama.cpp / llamafile — CPU-first, GGUF format; runs anywhere but at lower throughput than vLLM on GPUs.
- TGI (Text Generation Inference) by HuggingFace — conceptually similar; vLLM is faster on most benchmarks in 2026.
- SGLang — another serving framework with strong structured-output features; pick based on whether your workload favors throughput (vLLM) or programmable prompting (SGLang).
- TensorRT-LLM (NVIDIA) — lower latency than vLLM but NVIDIA-only and much more compilation complexity.
For "I'm building a real product that runs LLMs and care about throughput / cost per token," vLLM is the standard self-hosted serving framework in 2026.