Paperless-ngx: a self-hosted document archive with OCR

The stack

paperless-ngx (webserver) — Django app with the UI and API.
paperless-ngx (consumer) — the same image in a worker mode that watches the consume folder and processes new files.
redis — task broker for the consumer.
postgres — metadata: titles, tags, correspondents, dates.
gotenberg + tika (optional) — convert Office documents and emails to PDF before ingest.

docker compose

# docker-compose.yml (adapted from the upstream template)
services:
  broker:
    image: docker.io/library/redis:7
    restart: unless-stopped
    volumes:
      - redisdata:/data

  db:
    image: docker.io/library/postgres:16
    restart: unless-stopped
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    restart: unless-stopped
    depends_on: [db, broker, gotenberg, tika]
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - ./consume:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: ${DB_PASSWORD}
      PAPERLESS_OCR_LANGUAGE: eng+fra
      PAPERLESS_TIME_ZONE: America/Toronto
      PAPERLESS_URL: https://docs.example.com
      PAPERLESS_SECRET_KEY: ${SECRET_KEY}
      PAPERLESS_TIKA_ENABLED: 1
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998

  gotenberg:
    image: docker.io/gotenberg/gotenberg:8
    restart: unless-stopped
    command:
      - "gotenberg"
      - "--chromium-disable-javascript=true"
      - "--chromium-allow-list=file:///tmp/.*"

  tika:
    image: docker.io/apache/tika:latest
    restart: unless-stopped

volumes:
  data:
  media:
  pgdata:
  redisdata:

.env:

DB_PASSWORD=$(openssl rand -base64 36 | tr -d '\n')
SECRET_KEY=$(openssl rand -base64 60 | tr -d '\n')

Bring it up:

docker compose up -d
docker compose logs -f webserver

First start runs database migrations; then create the admin user:

docker compose exec webserver \
    python manage.py createsuperuser

The consume folder

Anything dropped into ./consume/ on the host (mounted at /usr/src/paperless/consume in the container) is picked up automatically: PDFs, JPGs, PNGs, TIFFs, and (with Tika enabled) .docx, .odt, .eml, etc. The consumer:

OCRs the document with Tesseract if needed
Detects the date, correspondent, tags using existing matching rules
Files the document by hash into media/documents/originals/
Generates a thumbnail
Indexes the OCR'd text into the search engine

The original file is moved out of ./consume/ on success.

Reverse proxy

# Caddy
docs.example.com {
    reverse_proxy 127.0.0.1:8000
    request_body { max_size 500MB }
}

Set PAPERLESS_URL=https://docs.example.com in the env so Paperless generates correct redirect URLs.

How to actually use it

Three patterns:

Drag-and-drop in the web UI. Simplest; works for occasional uploads.
Mobile scan apps — Paperless Mobile for Android, several iOS clients via the REST API. Snap a photo, auto-crop, auto-deskew, upload — same workflow as Dropbox's Scan feature, but landing in your own server.
Scanner with network destination. Document scanners (Brother ADS-2700W, Fujitsu ScanSnap, ePson WorkForce) can scan-to-SMB or scan-to-email. Point them at the consume folder share or an email address Paperless polls.

Tags, correspondents, and document types

These are Paperless's three classification dimensions:

Tag — a multi-valued label (taxes, healthcare, warranty, important).
Correspondent — the issuer / sender of the document (the city, the bank, the landlord).
Document type — a single-valued classification (Invoice, Receipt, Statement, Letter, Contract).

For each, you can set matching rules: a Tag with "match any of: invoice, facture" auto-applies to every document whose OCR'd text contains those words. After a few hundred documents, Paperless's optional ML classifier (off by default) can learn from your manual classifications and start suggesting auto-tags.

Email ingest

Settings → Mail → Add IMAP account. Paperless polls the inbox, downloads attachments matching the rule (size limits, sender filters, subject patterns), ingests them, and optionally marks the source email as read or moves it to a different folder. Useful for "Bills" folders that auto-route into the archive.

Backups

Three things to capture:

data/ — index, configuration
media/ — the original document files (the actual data, in their original format)
The Postgres database — metadata, tags, correspondents

For exportable, restorable backups, Paperless ships a one-command "document exporter":

docker compose exec webserver \
    document_exporter ../export --no-progress-bar --compare-checksums

This dumps the entire archive plus its JSON metadata into ./export/ in a format Paperless can re-import later (document_importer). Combined with a restic job (see restic + S3) on that directory, you get versioned, offsite, encrypted backups of the whole archive.

Storage growth

OCR adds a separate searchable PDF/A version of each document next to the original. Rule of thumb: 1.2–1.5× storage of the originals. For a household archive of multiple years, plan on a few GB; for a small business, tens of GB is normal.

What it isn't

Paperless-ngx is an archive, not a collaboration tool. No real-time co-editing, no versioned edits, no shared workspaces (beyond per-document permissions in 2.x). For "team workflows on living documents," use Outline, Bookstack, or NextCloud. For "save the PDF the bank sent and find it again in three years," Paperless is exactly the right tool.