Apache Airflow: data pipeline orchestration via Python DAGs

Install via docker compose (the official setup)

# Get the official compose file
curl -LO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'

# Critical: set AIRFLOW_UID for proper file permissions
mkdir -p ./dags ./logs ./plugins ./config
echo -e "AIRFLOW_UID=$(id -u)" > .env

# Initialize the database
docker compose up airflow-init

# Start the full stack: scheduler + webserver + triggerer + worker + postgres + redis
docker compose up -d

# Web UI at http://localhost:8080
# Default login: airflow / airflow (change immediately)

The architecture

Scheduler — watches DAG files, decides which tasks to run, queues them.
Worker — pulls queued tasks + executes them. Can be Celery workers (queue-based) or Kubernetes (one pod per task).
Webserver — the UI; shows DAGs, runs, logs, configuration.
Triggerer — handles deferred tasks (sensors that wait for events).
Metadata DB (Postgres) — state of every DAG, task instance, variable, connection.

The first DAG

# dags/etl_pipeline.py
from airflow import DAG
from airflow.decorators import task
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.providers.slack.notifications.slack_webhook import SlackWebhookNotifier
from datetime import datetime, timedelta

default_args = {
    "owner": "data-team",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="daily_etl",
    description="Daily extract + transform + load",
    schedule="0 3 * * *",                   # 3 AM every day
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args=default_args,
    tags=["etl", "production"],
    on_failure_callback=[SlackWebhookNotifier(text="ETL failed: {{ run_id }}")],
) as dag:

    @task
    def extract():
        import requests
        data = requests.get("https://api.example.com/users").json()
        return [u["id"] for u in data if u["active"]]

    @task
    def transform(user_ids: list):
        # ... process the IDs
        return {"count": len(user_ids), "ids": user_ids}

    load = PostgresOperator(
        task_id="load",
        postgres_conn_id="warehouse",
        sql="INSERT INTO daily_users (run_date, active_count) VALUES ('{{ ds }}', {{ ti.xcom_pull(task_ids='transform')['count'] }})",
    )

    ids = extract()
    processed = transform(ids)
    processed >> load

Drop this file in ./dags/; Airflow picks it up within ~30 seconds. The UI shows it; click "Run" to trigger manually or let the schedule fire.

The TaskFlow API

Modern Airflow (2.0+) supports the @task decorator (above) which is much cleaner than the old PythonOperator pattern. Functions become tasks; return values flow through XCom automatically.

@task
def fetch_orders():
    return [{"id": 1, "amount": 100}, {"id": 2, "amount": 200}]

@task
def total_amount(orders: list):
    return sum(o["amount"] for o in orders)

@task
def write_total(total: int):
    print(f"Total: {total}")

# Linear pipeline
orders = fetch_orders()
total = total_amount(orders)
write_total(total)

# Or fan-out / fan-in
results = process_in_parallel.expand(items=orders)
aggregate(results)

Sensors: wait for events

from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.sensors.http_sensor import HttpSensor

# Wait for an S3 file to appear
wait_for_file = S3KeySensor(
    task_id="wait_for_data",
    bucket_name="my-bucket",
    bucket_key="data/{{ ds }}/input.csv",
    aws_conn_id="aws_default",
    poke_interval=60,
    timeout=60*60*3,          # 3 hours
    mode="reschedule",        # don't block a worker slot; check periodically
)

# Wait for an API endpoint
wait_for_api = HttpSensor(
    task_id="api_ready",
    http_conn_id="my_api",
    endpoint="/health",
    response_check=lambda r: r.status_code == 200,
)

Providers: the operator ecosystem

Airflow's killer feature is the provider packages. Pre-built operators for:

AWS — S3, EMR, SageMaker, Redshift, Glue, Athena, DynamoDB, ECS, Lambda, Step Functions, SQS, SNS
GCP — BigQuery, GCS, Dataflow, Composer, Vertex AI
Azure — Data Factory, Synapse, Cosmos DB
Databases — Postgres, MySQL, Snowflake, Redshift, Oracle, ClickHouse, Trino
Compute — Kubernetes (KubernetesPodOperator), Docker, Spark, Ray
Comms — Slack, email, PagerDuty
dbt, Spark, Hive, Presto, etc.

Install via pip install apache-airflow-providers-amazon, ...-snowflake, etc.

Connections + Variables

Connections (database URLs, AWS keys, API tokens) are stored encrypted in the Airflow metadata DB; per-Connection ID; reference from operators. Configure via UI (Admin → Connections) or env vars / Secrets backend.

# Env var convention
export AIRFLOW_CONN_WAREHOUSE='postgresql://user:pw@db:5432/warehouse'

# Then in DAG
PostgresOperator(task_id="...", postgres_conn_id="warehouse", sql="...")

For secrets backends: Vault (see that tutorial), AWS Secrets Manager, GCP Secret Manager. Airflow can fetch credentials per-task instead of storing in the metadata DB.

Scaling: executors

SequentialExecutor — dev only; one task at a time.
LocalExecutor — parallel within one machine via multi-processing. Good for small / mid Airflow.
CeleryExecutor — queue-based; workers can be on multiple hosts. Most common production choice.
KubernetesExecutor — one Kubernetes pod per task. Pay-per-use; no warm worker pool. Right for K8s-heavy environments.
CeleryKubernetesExecutor — hybrid; lightweight tasks via Celery, heavy tasks via K8s pods.

Datasets + scheduling on data updates

Airflow 2.4+ added Datasets: declarative "this task produces dataset X" + "this DAG runs when dataset Y updates." Crude version of asset-based orchestration (like Dagster):

from airflow.datasets import Dataset

my_dataset = Dataset("s3://my-bucket/processed/users.parquet")

with DAG("producer", schedule="@daily") as dag:
    extract = ... >> transform(outlets=[my_dataset])

# Downstream DAG triggers when my_dataset is updated
with DAG("consumer", schedule=[my_dataset]) as dag:
    consume = ...

Airflow 3.x

Airflow 3.x (released 2025) modernized the architecture: task SDK + execution API, multi-language tasks (not just Python), DAG versioning, better dev experience. Worth upgrading from 2.x if you're starting fresh.

Airflow vs Dagster vs Prefect

Airflow — task-graph mental model; largest ecosystem of operators; battle-tested at huge scale. Heavier operational footprint.
Dagster (see that tutorial) — asset-graph; lineage as first-class; modern Python API. Smaller ecosystem.
Prefect — flow-based Python; cleaner UX; smaller scope than Airflow.
Argo Workflows (see that tutorial) — Kubernetes-native; container-step-shaped; less Python-first.

When Airflow is the right pick

You're already on Airflow + the team knows it.
Massive ecosystem of operators matters (you need 20+ integrations off-the-shelf).
Workflows are clearly task-graph-shaped, not asset-graph-shaped.

When it isn't

For "I have a data warehouse + dashboards + want lineage," Dagster maps the work more honestly.
For lightweight Python flows, Prefect is less ceremony.
For "just run these jobs as K8s pods," Argo Workflows is simpler.
For "run this Python script daily," a systemd timer is one unit, not 5 Airflow components.