DuckDB for analytics on local files

Install

Three common paths:

# Linux (static binary, no system package needed)
curl https://install.duckdb.org | sh
# binary lands in ~/.duckdb/cli/latest/duckdb — add to PATH

# macOS
brew install duckdb

# Python (most common for embedding in scripts)
pip install duckdb

The CLI starts an in-memory database by default:

$ duckdb
v1.x  release-tagged
D SELECT 'hello';
hello

Querying files directly

The single most useful DuckDB feature: SQL over files, no CREATE TABLE, no import step. Every file format has a table function (read_csv, read_parquet, read_json), and a path can also be used directly in FROM when the format is unambiguous.

-- CSV
SELECT region, sum(amount)
FROM 'sales-2026-q1.csv'
GROUP BY region
ORDER BY 2 DESC;

-- Parquet (column subset only the projected columns are read)
SELECT user_id, ts
FROM 'events/*.parquet'
WHERE event = 'signup'
  AND ts >= '2026-01-01';

-- JSON (newline-delimited)
SELECT data->>'$.user.email' AS email, count(*)
FROM read_json_auto('logs/*.json')
WHERE data->>'$.path' = '/checkout'
GROUP BY 1;

Globs are expanded by DuckDB itself, so events/*.parquet works without a shell. Predicates push down into the Parquet reader — for partitioned data, only the relevant row groups are read off disk.

Joining across file formats

Heterogeneous joins are the demo that sells DuckDB — one query, three different file formats:

SELECT
    u.id,
    u.email,
    count(e.event_id) AS events
FROM 'users.parquet'   AS u
LEFT JOIN read_json_auto('events.ndjson') AS e ON e.user_id = u.id
LEFT JOIN 'features.csv'                  AS f ON f.user_id = u.id
WHERE f.cohort = 'beta'
GROUP BY u.id, u.email
ORDER BY events DESC
LIMIT 50;

No ETL stage, no import script. The optimizer sees the projections and pushes them down per source.

Persisting to a .duckdb file

Start with a file argument to use a persistent database instead of in-memory:

$ duckdb analytics.duckdb
D CREATE TABLE events AS
    SELECT * FROM 'events/*.parquet';
D CREATE INDEX events_user ON events(user_id);
D SELECT count(*) FROM events;

A persistent .duckdb file is useful when the same dataset is queried many times — the file is column-oriented, compressed, and indexed; subsequent queries are faster than re-reading Parquet for ad-hoc work.

Cross-database joins work too. Attach a SQLite or PostgreSQL database and query both engines at once:

INSTALL sqlite;  LOAD sqlite;
ATTACH 'app.sqlite' AS app (TYPE sqlite);

SELECT
    u.id, u.email,
    count(o.id) AS orders
FROM app.users u
LEFT JOIN 'orders.parquet' o ON o.user_id = u.id
GROUP BY 1, 2;

Reading from S3 / HTTP

The httpfs extension turns S3 (and any HTTPS URL) into a virtual filesystem:

INSTALL httpfs;  LOAD httpfs;

-- HTTPS
SELECT count(*)
FROM 'https://datasets.example.com/2026-04.parquet';

-- S3 with a credential chain (env vars, EC2 IMDS, etc.)
CREATE SECRET (TYPE S3, PROVIDER credential_chain);

SELECT region, sum(amount)
FROM 's3://my-bucket/sales/year=2026/month=*/*.parquet'
GROUP BY 1;

Predicate push-down still applies — for partitioned Hive-style paths, DuckDB skips entire prefixes that don't match the WHERE. For columnar Parquet on object storage, it reads only the row groups and columns the query needs, which is most of the operational difference between "fast" and "unusably slow" at this scale.

Python integration

import duckdb
import pandas as pd

# Query a Parquet file directly into a DataFrame
df = duckdb.sql("""
    SELECT region, sum(amount) AS total
    FROM 'sales-2026-q1.parquet'
    GROUP BY 1
    ORDER BY 2 DESC
""").df()

# Query a DataFrame as if it were a SQL table
sales = pd.read_parquet("sales-2026-q1.parquet")
duckdb.sql("SELECT region, sum(amount) FROM sales GROUP BY 1").df()

The DataFrame variable is registered with DuckDB by name automatically. The same trick works with Polars DataFrames and PyArrow tables.

Writing outputs

-- Write query results to disk
COPY (SELECT * FROM events WHERE ts >= '2026-01-01') TO 'recent.parquet';
COPY (SELECT * FROM events) TO 'out.csv' (HEADER);

-- Or in one step
CREATE TABLE recent AS
    SELECT * FROM 'events/*.parquet' WHERE ts >= '2026-01-01';
COPY recent TO 'recent.parquet';

What it isn't for

DuckDB is single-node, single-writer. It's not a replacement for a transactional database (use Postgres for that) or a distributed warehouse (use Snowflake/BigQuery/Trino for the multi-TB-fan-out case). The sweet spot is the "I have a directory of Parquet files and a question" workflow that previously required spinning up Spark or writing a hundred lines of pandas — DuckDB does it in five lines of SQL and finishes before spark-submit would have started.