Dagster: asset-graph thinking for data pipelines

Install

# Create a Python virtualenv (or use uv — see /tutorials/uv-fast-python-toolchain.html)
uv init my-pipelines && cd my-pipelines

uv add dagster dagster-webserver dagster-postgres \
       pandas pyarrow duckdb

# Or with pip
pip install dagster dagster-webserver dagster-postgres pandas pyarrow duckdb

The first asset

# defs.py
import pandas as pd
import dagster as dg

@dg.asset
def raw_orders() -> pd.DataFrame:
    """Fetch all orders from the source system."""
    return pd.read_csv("https://example.com/orders.csv")

@dg.asset
def orders(raw_orders: pd.DataFrame) -> pd.DataFrame:
    """Clean + enrich orders."""
    return raw_orders.assign(
        total_with_tax=raw_orders["total"] * 1.13,
        ts=pd.to_datetime(raw_orders["created_at"]),
    )

@dg.asset
def daily_revenue(orders: pd.DataFrame) -> pd.DataFrame:
    """Daily revenue summary."""
    return orders.groupby(orders["ts"].dt.date)["total_with_tax"].sum().reset_index()

defs = dg.Definitions(assets=[raw_orders, orders, daily_revenue])

That's the whole pipeline. orders depends on raw_orders because its function takes it as a typed parameter; Dagster reads that to build the dependency graph. daily_revenue depends on orders. No "DAG of tasks" boilerplate.

Run the dev environment

dg dev

The web UI opens at http://localhost:3000. The Assets tab shows the graph:

raw_orders → orders → daily_revenue

Click any asset to materialize it (run the code that produces it). Click "Materialize all" to run the whole graph in dependency order. The UI shows live logs, the schema of the output (when Pandas is involved), and per-asset metadata.

I/O managers: where data lives

By default, Dagster pickles your assets to local disk. For production, configure an I/O manager that writes to S3, Postgres, Snowflake, BigQuery, etc.:

from dagster_aws.s3 import s3_pickle_io_manager
from dagster_duckdb_pandas import duckdb_pandas_io_manager

defs = dg.Definitions(
    assets=[...],
    resources={
        "io_manager": duckdb_pandas_io_manager.configured({
            "database": "data/warehouse.duckdb",
        })
    }
)

Now every asset's output is a DuckDB table (see DuckDB tutorial); downstream assets read it back as a Pandas DataFrame automatically. Swap I/O manager → data moves to S3 / Snowflake / etc., zero code changes in assets.

Partitioned assets

For time-series data:

@dg.asset(
    partitions_def=dg.DailyPartitionsDefinition(start_date="2024-01-01"),
)
def daily_events(context: dg.AssetExecutionContext) -> pd.DataFrame:
    day = context.partition_key       # "2024-05-22"
    return pd.read_parquet(f"s3://events/year=2024/month=05/day={day[-2:]}.parquet")

Now Dagster knows the asset has per-day partitions. The UI shows a per-day status grid; dagster asset backfill can re-run a date range; new partitions can be auto-materialized on a schedule.

Schedules & sensors

# Schedule: re-materialize daily at 03:15
@dg.schedule(cron_schedule="15 3 * * *", target=dg.AssetSelection.all())
def nightly_refresh(): pass

# Sensor: check an external system, trigger materialization when something changes
@dg.sensor(target=dg.AssetSelection.assets("daily_events"))
def new_file_sensor(context: dg.SensorEvaluationContext):
    new_keys = list_new_s3_keys_since(context.cursor)
    for key in new_keys:
        yield dg.RunRequest(run_key=key, partition_key=key)
    context.update_cursor(latest_key)

Schedules and sensors run inside the daemon process; sensors poll on a tick interval (default 30s).

Automation conditions: the modern Dagster way

Instead of writing manual schedules + sensors, declare when each asset should be materialized on the asset itself:

@dg.asset(
    automation_condition=dg.AutomationCondition.eager(),
)
def fresh_view(upstream: pd.DataFrame) -> pd.DataFrame:
    return upstream.rolling(7).mean()

@dg.asset(
    automation_condition=dg.AutomationCondition.on_cron("@daily"),
)
def daily_report(...) -> ...:
    ...

Dagster's automation engine evaluates these conditions and triggers runs automatically. Combine: "re-materialize this asset eagerly when any upstream changes, but only between 9am and 5pm on weekdays."

Production deployment

Dagster has three processes:

webserver (dagster-webserver) — the UI + GraphQL API
daemon — runs schedules + sensors + auto-materialization + automation conditions
user code servers — loaded gRPC servers that hold your pipeline definitions

For production:

Postgres as the metadata store (run history, asset materialization log, schedule state).
Dagster Helm chart for Kubernetes deployments.
Or dagster-postgres + systemd for VM-based deploys.
For run execution, choose an executor: in-process (dev), multi-process (single host), k8s-job (one pod per run), Celery, Dask.

dbt + Dagster: the canonical pairing

Dagster has first-class dbt support — every dbt model becomes a Dagster asset, with the dbt-defined lineage. Run dbt and downstream Python in one DAG, with one UI showing both:

from dagster_dbt import dbt_assets, DbtCliResource

dbt_resource = DbtCliResource(project_dir="./dbt_project")

@dbt_assets(manifest=dbt_resource.cli(["parse"], target_path=Path("target")).target_path.joinpath("manifest.json"))
def my_dbt_assets(context, dbt: DbtCliResource):
    yield from dbt.cli(["build"], context=context).stream()

The UI then shows your dbt model lineage as part of the asset graph, with materialization logs from dbt run integrated.

Dagster vs Airflow vs Prefect

Airflow — the elder. Task-graph mental model; massive ecosystem; battle-tested at huge scale. Operationally heavier; the asset-lineage story is bolted on later.
Prefect — flow-based Python; lighter than Airflow; cleaner Python API. Less focus on the data lineage / asset-as-product framing than Dagster.
Dagster — asset-first mental model; best fit when "what data products do we have, are they fresh, what produces them" is the actual question; integrates dbt + Python + SQL transparently.

When Dagster isn't the right tool

For workflows where the unit of work isn't a data product (job orchestration, ETL into a black box, scheduled batch reports without lineage), Airflow's task-graph fits better.
For ML training pipelines with elaborate experiment tracking, MLflow / Metaflow are more specialized.
For very simple "run this Python script daily" jobs, a systemd timer (see that tutorial) is simpler.

For "we have a data warehouse + dashboards + ML models and need to know what's fresh, who consumes what, and when to rebuild," Dagster is the orchestrator with the right primitives in 2026.