vLLM: production-grade local LLM serving

Install

# Via pip into a Python 3.10+ venv (or uv; see /tutorials/uv-fast-python-toolchain.html)
uv init vllm-server && cd vllm-server
uv add vllm

# Or pip in a regular venv
python3 -m venv .venv && source .venv/bin/activate
pip install vllm

vLLM needs CUDA (NVIDIA), ROCm (AMD), or works on CPU as a fallback (very slow; only for testing). For production: NVIDIA L40S / A100 / H100, AMD MI300X, or consumer cards like RTX 4090 / 5090 for smaller models.

Serve a model with the OpenAI-compatible API

vllm serve mistralai/Mistral-Small-Instruct-2509 \
    --port 8000 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --tensor-parallel-size 1

That's it. vLLM loads the model from HuggingFace, allocates a KV cache sized to fit the remaining GPU memory, and exposes an OpenAI-compatible API at http://localhost:8000/v1.

Any OpenAI client works:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
    model="mistralai/Mistral-Small-Instruct-2509",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

The throughput story

vLLM's killer feature is continuous batching: requests are batched at the token level, not the request level. When request A finishes a token, A's batch slot can immediately start a different request's next token, without waiting for the entire batch to finish. Plus PagedAttention: the KV cache is paged (analogous to virtual memory) instead of contiguous, allowing dramatically tighter memory packing.

For a server handling many concurrent users, this translates to:

5-20x higher throughput (tokens/sec aggregate) than serving with HuggingFace Transformers or Ollama.
Predictable P99 latency under load.
Better GPU utilization — an H100 actually busy at 80-90% instead of single-stream Ollama's 30-40%.

Multi-GPU: tensor and pipeline parallelism

For models that don't fit on one GPU:

# Spread one model's layers across 4 GPUs via tensor parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 16384

# Or pipeline parallelism (assign different layer ranges to different GPUs)
vllm serve very-large-model \
    --pipeline-parallel-size 2 \
    --tensor-parallel-size 2

Combined with multi-node deployment (one process per node, pipeline-parallel across them), vLLM serves frontier-scale models (200B+ active) on commodity hardware clusters.

Quantization for smaller VRAM

# INT8 / FP8 quantization (faster, less VRAM)
vllm serve mistralai/Mistral-Small-Instruct-2509 \
    --quantization fp8

# AWQ / GPTQ pre-quantized models
vllm serve TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ \
    --quantization awq

# GGUF (in vLLM 0.7+) — the format Ollama uses
vllm serve some-org/Llama-3.1-8B-Instruct-GGUF

FP8 on Hopper (H100) gives nearly the same quality as FP16 at half the memory + faster compute. AWQ is the sweet spot for consumer GPUs; reduces a 70B model from 140 GB to ~40 GB.

Structured output guidance

vLLM supports constrained decoding — force the model to output valid JSON / regex-matching / from a grammar:

resp = client.chat.completions.create(
    model="...",
    messages=[...],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "extract",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"},
                },
                "required": ["name"]
            }
        }
    },
)

The model's logits are masked at generation time to only allow tokens that lead to a valid JSON shape. Output is guaranteed schema-conformant; no "model said it would JSON-format but added prose around it" failures.

Prefix caching

For workloads where many requests share long prefixes (a chat with a 4 KB system prompt, RAG queries with a fixed instruction header), prefix caching reuses the computed KV cache for the shared portion across requests:

vllm serve ... --enable-prefix-caching

Throughput on workloads with long shared prefixes improves 2-5x.

Speculative decoding

Use a small "draft" model to propose K tokens, then have the big model verify them in parallel. When the draft agrees, you got K tokens in one forward pass:

vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --speculative_model meta-llama/Llama-3.2-1B-Instruct \
    --num_speculative_tokens 5

For workloads where the draft model agrees with the big one most of the time (typical English text), end-to-end latency drops 1.5-2x.

Production deployment

# Systemd unit
[Unit]
Description=vLLM server
After=network-online.target

[Service]
User=vllm
Group=vllm
Environment=HF_HOME=/var/lib/vllm/huggingface
Environment=VLLM_NO_USAGE_STATS=1
ExecStart=/opt/vllm/bin/vllm serve mistralai/Mistral-Small-Instruct-2509 \
    --port 8000 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --quantization fp8
Restart=always

[Install]
WantedBy=multi-user.target

Or as a container:

docker run --gpus all \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-Small-Instruct-2509 \
    --port 8000

Monitoring

vLLM exposes Prometheus metrics at /metrics:

Request count + latency histograms
Token throughput (prompt vs generated)
GPU memory + KV-cache usage
Queue depth + scheduler stats

Pair with Prometheus + Grafana (see that tutorial) for production observability. The KV-cache utilization graph is the most useful single chart for capacity planning.

vLLM vs alternatives

Ollama (see that tutorial) — great UX, dev-focused, single-user model. vLLM trades UX for throughput.
llama.cpp / llamafile — CPU-first, GGUF format; runs anywhere but at lower throughput than vLLM on GPUs.
TGI (Text Generation Inference) by HuggingFace — conceptually similar; vLLM is faster on most benchmarks in 2026.
SGLang — another serving framework with strong structured-output features; pick based on whether your workload favors throughput (vLLM) or programmable prompting (SGLang).
TensorRT-LLM (NVIDIA) — lower latency than vLLM but NVIDIA-only and much more compilation complexity.

For "I'm building a real product that runs LLMs and care about throughput / cost per token," vLLM is the standard self-hosted serving framework in 2026.