Install via docker compose
# docker-compose.yml
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm
restart: unless-stopped
ports:
- "127.0.0.1:4000:4000"
volumes:
- ./config.yaml:/app/config.yaml:ro
environment:
DATABASE_URL: postgres://litellm:${DB_PASSWORD}@postgres:5432/litellm
LITELLM_MASTER_KEY: ${MASTER_KEY} # sk-... super-admin key
LITELLM_SALT_KEY: ${SALT_KEY} # for encrypting per-team creds
STORE_MODEL_IN_DB: "True"
UI_USERNAME: admin
UI_PASSWORD: ${UI_PASSWORD}
command: ["--config", "/app/config.yaml", "--port", "4000"]
depends_on: [ postgres ]
postgres:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
config.yaml: declare the models
model_list:
# OpenAI passthrough
- model_name: gpt-5
litellm_params:
model: openai/gpt-5
api_key: os.environ/OPENAI_API_KEY
# Anthropic
- model_name: claude-opus
litellm_params:
model: anthropic/claude-opus-4-7
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# Bedrock
- model_name: bedrock-claude
litellm_params:
model: bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
# Local Ollama (see /tutorials/ollama-self-host-llms-linux.html)
- model_name: local-llama
litellm_params:
model: ollama/llama3.3:70b
api_base: http://ollama:11434
# Mistral via Groq for speed
- model_name: groq-mixtral
litellm_params:
model: groq/mixtral-8x7b-32768
api_key: os.environ/GROQ_API_KEY
litellm_settings:
drop_params: true # drop unsupported per-provider params
num_retries: 3
request_timeout: 60
set_verbose: false
# Fallback chain: if gpt-5 fails, try claude-opus, then groq-mixtral
fallbacks:
- gpt-5: [claude-opus, groq-mixtral]
- claude-opus: [groq-mixtral]
# Budget enforcement
max_budget: 100 # USD per month per virtual API key
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
store_model_in_db: true
database_connection_pool_limit: 10
# Spending alerts
alerting:
- slack
alerting_threshold: 0.9 # alert at 90% of budget
docker compose up -d
docker compose logs -f litellm
Use it from any OpenAI client
# Python with openai SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:4000",
api_key="sk-...", # a per-team virtual key from LiteLLM
)
# Use any of the model_names you declared
resp = client.chat.completions.create(
model="claude-opus",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
# Same call shape works for "groq-mixtral", "local-llama", etc.
Per-team / per-user API keys
The web UI at http://<host>:4000/ui (login with master_key as the password, or set UI_USERNAME/UI_PASSWORD) lets you:
- Create teams + add users.
- Issue per-team API keys (
sk-...). - Set per-key model allowlist ("this team can only call cheap models").
- Set per-key budget caps (
$50/month; refuses requests when exceeded). - Set per-key rate limits (
100 requests/minute). - View spend dashboards by team / model / day.
For an organization with many engineers using LLMs, this means one upstream contract per provider (with full API credit), and N internal virtual keys with proper budgeting / observability.
Cost tracking
LiteLLM tracks token usage per request, applies the provider's per-token rates (built-in cost table for all major providers), and logs to the database. The UI shows:
- Total spend this month vs last month.
- Per-team / per-key / per-model breakdown.
- Daily / hourly heatmap.
- Top-spending users / endpoints.
Export to CSV or hit the API for custom dashboards. Pair with Grafana (see Prometheus tutorial) for org-wide observability.
Routing strategies
litellm_settings:
# Strategies: simple-shuffle, least-busy, usage-based-routing, latency-based-routing
routing_strategy: usage-based-routing
# Per-model deployment list with weights / fallback rules
model_list:
- model_name: gpt-4o-equivalent
litellm_params:
model: openai/gpt-4o
tpm: 100000 # rate limit token-per-minute
rpm: 1000 # requests-per-minute
- model_name: gpt-4o-equivalent
litellm_params:
model: azure/gpt-4o-deployment
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
tpm: 200000
rpm: 500
Now requests for gpt-4o-equivalent route to whichever of OpenAI or Azure has more capacity right now. Useful for high-throughput apps that need to spread load across multiple provider accounts.
Fallback chains
If the primary fails (rate limit, 5xx, timeout), automatically try the fallback list:
fallbacks:
- gpt-5: [claude-opus, groq-mixtral]
Apps see a successful response; the failover is transparent. Useful when a provider has an outage (which they do; ChatGPT goes down regularly enough to matter).
Caching
litellm_settings:
cache: true
cache_params:
type: redis
host: redis
port: 6379
ttl: 600 # seconds
Identical requests within the TTL window return cached responses. For repeated agent queries, multi-user systems where many users ask the same FAQ-style question, or test-replaying RAG queries, the cache hit rate can be meaningful.
Pair with LibreChat / aider
- LibreChat (see that tutorial) — point its OpenAI endpoint at LiteLLM. One LiteLLM = one billing endpoint for the entire team's chat usage.
- aider (see that tutorial) —
--openai-api-base http://litellm:4000 --openai-api-key sk-team-.... All engineering use of aider tracked centrally. - n8n (see that tutorial) — use the OpenAI node, point at LiteLLM. Workflow LLM costs visible alongside the rest.
Worth knowing
- LiteLLM passes most parameters through.
drop_params: truesilently drops provider-specific ones that another provider doesn't understand (so the same call works across providers). - For streaming responses, LiteLLM streams transparently — the proxy adds minimal latency overhead.
- The Python library (
pip install litellm) provides the same functionality as a library, without the proxy. Useful for scripts that don't want a separate process.