Grafana Tempo: distributed traces backend

Install (single-binary mode)

# docker-compose.yml
services:
  tempo:
    image: grafana/tempo:latest
    container_name: tempo
    restart: unless-stopped
    command: [ "-config.file=/etc/tempo/tempo.yml" ]
    ports:
      - "127.0.0.1:3200:3200"     # tempo HTTP query
      - "127.0.0.1:4317:4317"     # OTLP gRPC
      - "127.0.0.1:4318:4318"     # OTLP HTTP
    volumes:
      - ./tempo.yml:/etc/tempo/tempo.yml:ro
      - tempo-data:/var/tempo
volumes:
  tempo-data:

tempo.yml (minimal monolithic-mode config):

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268
    zipkin:
      endpoint: 0.0.0.0:9411

ingester:
  trace_idle_period: 10s
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 720h        # 30 days

storage:
  trace:
    backend: local                # or s3 / gcs / azure
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

querier:
  search:
    query_timeout: 30s

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write     # service graphs & span metrics

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics]

Bring it up: docker compose up -d. Tempo listens on OTLP at 4317 / 4318; applications instrumented with OpenTelemetry SDKs can send spans directly. Or front Tempo with an OpenTelemetry Collector (see that tutorial) for filtering, sampling, attribute manipulation.

Object-storage backend (production)

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.example.com
      access_key: ${S3_KEY}
      secret_key: ${S3_SECRET}
      insecure: false
    wal:
      path: /var/tempo/wal
    pool:
      max_workers: 100

For MinIO (see that tutorial), Backblaze B2, AWS S3, Cloudflare R2 — same shape, different endpoint. Object storage cost for trace data is typically pennies per GB-month; Tempo's storage cost-per-trace is dramatically lower than Jaeger-on-Elasticsearch.

Wire Grafana

In Grafana → Data Sources → Add → Tempo. URL: http://tempo:3200. The Explore tab now has a Tempo tab where you can:

Look up by trace ID
Run TraceQL queries
See service graphs (auto-generated from the metrics-generator)

TraceQL: actually queryable traces

# All traces from the payment service taking > 1 second
{ service.name = "payment-service" } && { duration > 1s }

# Traces that hit /api/checkout AND had an error
{ name = "POST /api/checkout" && status = error }

# Spans with a specific HTTP status code
{ span.http.status_code = 500 }

# Traces involving these services in this order
{ service.name = "frontend" } >> { service.name = "api" } >> { service.name = "db" }

# Aggregations
{ name="GET /products" } | avg(span.http.duration) by(span.http.status_code)

TraceQL replaces the "find a trace ID, hope it's the right one" debugging story with proper span-attribute search.

Span metrics + service graph (the killer add-on)

With metrics_generator enabled (above), Tempo auto-generates two streams of Prometheus metrics from incoming spans:

Span metrics — per-service / per-operation RED metrics (rate, error rate, duration histogram). Free observability for any traced service without manually adding instrumentation.
Service graphs — pairs of services that talked to each other, with edge weights for traffic + error rate. Auto-discovered architecture diagram.

Send the generated metrics to Prometheus / Mimir / VictoriaMetrics (see that tutorial) via remote_write; query alongside your other metrics.

Instrument an app (OpenTelemetry SDK)

# Python example
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install   # auto-installs SDK + instrumentations

# Run the app with auto-instrumentation
OTEL_SERVICE_NAME=payment-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo:4317 \
OTEL_EXPORTER_OTLP_PROTOCOL=grpc \
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python main.py

10% sampling by default; tune based on traffic volume. Auto-instrumentation captures HTTP server / client + database client + Redis + etc. with zero code changes. For custom spans within app code, add OpenTelemetry SDK calls.

Tail-based sampling

Random sampling at the app loses interesting outliers (the slow request, the error). Tail-based sampling: send everything to a collector, decide which traces to keep after seeing all the spans:

# In an OpenTelemetry Collector config feeding Tempo
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow, type: latency, latency: { threshold_ms: 1000 } }
      - { name: sample-1pct, type: probabilistic, probabilistic: { sampling_percentage: 1 } }

exporters:
  otlp:
    endpoint: tempo:4317
    tls: { insecure: true }

Keeps 100% of errors, 100% of slow traces, 1% baseline. Captures every interesting failure; reduces storage cost.

Tempo's architecture (cluster mode)

For high-throughput production, run Tempo in microservices mode:

distributor — receives spans, routes to ingesters
ingester — buffers in memory, periodically flushes blocks to object storage
querier — serves trace lookups + TraceQL queries
compactor — merges small blocks in object storage
query-frontend — splits large queries across queriers
metrics-generator — computes span metrics + service graphs

Each component scales independently. Helm chart available for Kubernetes deployments.

Tempo vs alternatives

Jaeger — the elder; the canonical OpenTelemetry-compat traces backend. Cassandra / Elasticsearch backends are operationally heavy.
Zipkin — older still; simpler.
SigNoz — ClickHouse-backed observability platform (metrics + logs + traces from one DB). Compelling if you want one tool for all three.
Honeycomb / Datadog APM / Lightstep / NewRelic — commercial SaaS. Better UX for analysis; per-event pricing.
OpenSearch / Elasticsearch — Jaeger's traditional backend.

For "I run Grafana, I want traces, I want them stored cheaply on object storage I already have," Tempo is the right pick in 2026.

Worth knowing

OpenTelemetry is the standard. Tempo speaks OTLP natively; Jaeger / Zipkin are supported for compatibility but new instrumentation should use OTel.
Traces are expensive without sampling. A 100 req/sec service generating 100% traces is millions of spans per day. Sample aggressively; keep errors / outliers.
Loki / Tempo / Mimir share the same Grafana-native object-storage architecture. Operating all three together is mostly the same skill set.