LiteLLM: one OpenAI-shaped endpoint in front of every LLM provider

Install via docker compose

# docker-compose.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    restart: unless-stopped
    ports:
      - "127.0.0.1:4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    environment:
      DATABASE_URL: postgres://litellm:${DB_PASSWORD}@postgres:5432/litellm
      LITELLM_MASTER_KEY: ${MASTER_KEY}        # sk-...  super-admin key
      LITELLM_SALT_KEY: ${SALT_KEY}            # for encrypting per-team creds
      STORE_MODEL_IN_DB: "True"
      UI_USERNAME: admin
      UI_PASSWORD: ${UI_PASSWORD}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on: [ postgres ]

  postgres:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

config.yaml: declare the models

model_list:
  # OpenAI passthrough
  - model_name: gpt-5
    litellm_params:
      model: openai/gpt-5
      api_key: os.environ/OPENAI_API_KEY

  # Anthropic
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-7
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  # Bedrock
  - model_name: bedrock-claude
    litellm_params:
      model: bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # Local Ollama (see /tutorials/ollama-self-host-llms-linux.html)
  - model_name: local-llama
    litellm_params:
      model: ollama/llama3.3:70b
      api_base: http://ollama:11434

  # Mistral via Groq for speed
  - model_name: groq-mixtral
    litellm_params:
      model: groq/mixtral-8x7b-32768
      api_key: os.environ/GROQ_API_KEY

litellm_settings:
  drop_params: true                        # drop unsupported per-provider params
  num_retries: 3
  request_timeout: 60
  set_verbose: false

  # Fallback chain: if gpt-5 fails, try claude-opus, then groq-mixtral
  fallbacks:
    - gpt-5: [claude-opus, groq-mixtral]
    - claude-opus: [groq-mixtral]

  # Budget enforcement
  max_budget: 100                          # USD per month per virtual API key

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true
  database_connection_pool_limit: 10

  # Spending alerts
  alerting:
    - slack
  alerting_threshold: 0.9                  # alert at 90% of budget

docker compose up -d
docker compose logs -f litellm

Use it from any OpenAI client

# Python with openai SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:4000",
    api_key="sk-...",          # a per-team virtual key from LiteLLM
)

# Use any of the model_names you declared
resp = client.chat.completions.create(
    model="claude-opus",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

# Same call shape works for "groq-mixtral", "local-llama", etc.

Per-team / per-user API keys

The web UI at http://<host>:4000/ui (login with master_key as the password, or set UI_USERNAME/UI_PASSWORD) lets you:

Create teams + add users.
Issue per-team API keys (sk-...).
Set per-key model allowlist ("this team can only call cheap models").
Set per-key budget caps ($50/month; refuses requests when exceeded).
Set per-key rate limits (100 requests/minute).
View spend dashboards by team / model / day.

For an organization with many engineers using LLMs, this means one upstream contract per provider (with full API credit), and N internal virtual keys with proper budgeting / observability.

Cost tracking

LiteLLM tracks token usage per request, applies the provider's per-token rates (built-in cost table for all major providers), and logs to the database. The UI shows:

Total spend this month vs last month.
Per-team / per-key / per-model breakdown.
Daily / hourly heatmap.
Top-spending users / endpoints.

Export to CSV or hit the API for custom dashboards. Pair with Grafana (see Prometheus tutorial) for org-wide observability.

Routing strategies

litellm_settings:
  # Strategies: simple-shuffle, least-busy, usage-based-routing, latency-based-routing
  routing_strategy: usage-based-routing

  # Per-model deployment list with weights / fallback rules
  model_list:
    - model_name: gpt-4o-equivalent
      litellm_params:
        model: openai/gpt-4o
      tpm: 100000          # rate limit token-per-minute
      rpm: 1000            # requests-per-minute

    - model_name: gpt-4o-equivalent
      litellm_params:
        model: azure/gpt-4o-deployment
        api_base: os.environ/AZURE_API_BASE
        api_key: os.environ/AZURE_API_KEY
      tpm: 200000
      rpm: 500

Now requests for gpt-4o-equivalent route to whichever of OpenAI or Azure has more capacity right now. Useful for high-throughput apps that need to spread load across multiple provider accounts.

Fallback chains

If the primary fails (rate limit, 5xx, timeout), automatically try the fallback list:

fallbacks:
  - gpt-5: [claude-opus, groq-mixtral]

Apps see a successful response; the failover is transparent. Useful when a provider has an outage (which they do; ChatGPT goes down regularly enough to matter).

Caching

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 600          # seconds

Identical requests within the TTL window return cached responses. For repeated agent queries, multi-user systems where many users ask the same FAQ-style question, or test-replaying RAG queries, the cache hit rate can be meaningful.

Pair with LibreChat / aider

LibreChat (see that tutorial) — point its OpenAI endpoint at LiteLLM. One LiteLLM = one billing endpoint for the entire team's chat usage.
aider (see that tutorial) — --openai-api-base http://litellm:4000 --openai-api-key sk-team-.... All engineering use of aider tracked centrally.
n8n (see that tutorial) — use the OpenAI node, point at LiteLLM. Workflow LLM costs visible alongside the rest.

Worth knowing

LiteLLM passes most parameters through. drop_params: true silently drops provider-specific ones that another provider doesn't understand (so the same call works across providers).
For streaming responses, LiteLLM streams transparently — the proxy adds minimal latency overhead.
The Python library (pip install litellm) provides the same functionality as a library, without the proxy. Useful for scripts that don't want a separate process.