Kernel-level observability with eBPF and bpftrace

Install

# Debian / Ubuntu
sudo apt install bpftrace bpfcc-tools linux-headers-$(uname -r)

# Arch
sudo pacman -S bpftrace bcc

# Fedora
sudo dnf install bpftrace bcc-tools

Kernel needs to be 5.x+ with BTF (BPF Type Format) for the smoothest experience; most distro kernels ship that. Verify:

ls /sys/kernel/btf/vmlinux
bpftrace --info | head

If vmlinux isn't there, kernel function tracing still works but type-aware access (struct foo field reads) doesn't, and many one-liners below won't compile.

One-liners

Most of bpftrace's day-to-day value is one-liners against existing probes.

# Count every syscall by name, system-wide
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter
    { @[ksym(args->args[1])] = count(); }'

# Histogram of read() sizes from any process
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_read
    { @sizes = hist(args->count); }'

# Print every command executed system-wide (replacement for forkstat / execsnoop)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve
    { printf("%s %s\n", comm, str(args->filename)); }'

# Latency of every open() call
sudo bpftrace -e '
    tracepoint:syscalls:sys_enter_openat { @t[tid] = nsecs; }
    tracepoint:syscalls:sys_exit_openat /@t[tid]/ {
        @lat = hist(nsecs - @t[tid]);
        delete(@t[tid]);
    }
'

# Track who is opening /etc/shadow
sudo bpftrace -e '
    tracepoint:syscalls:sys_enter_openat
    /str(args->filename) == "/etc/shadow"/
    { printf("pid=%d comm=%s\n", pid, comm); }
'

Ctrl-C to stop. bpftrace prints any aggregations (@name maps) it accumulated.

Built-in tools

The bcc / bpftrace projects ship dozens of curated diagnostic tools that are essentially named bpftrace scripts:

# Listed under /usr/share/bpftrace/tools/ (or /usr/sbin/ on some distros)
biolatency-bpfcc        # block I/O latency histogram
biosnoop-bpfcc          # every block I/O with PID, file, offset, latency
execsnoop-bpfcc         # every process exec
opensnoop-bpfcc         # every open() with path + result
tcptop-bpfcc            # top-style TCP throughput per connection
tcpconnlat-bpfcc        # TCP connect latency distribution
runqlat-bpfcc           # CPU run queue latency
slabratetop-bpfcc       # kernel slab allocations by type
profile-bpfcc           # sampling profiler with stack traces
funccount-bpfcc         # count calls to a kernel function

Each is a complete diagnostic in itself; reading the source of one is the easiest way to learn bpftrace by example.

Custom scripts

A bpftrace script lives in a .bt file. Example: file-open latency by command, only for opens that take more than 1 ms:

#!/usr/bin/env bpftrace
// open-slow.bt

tracepoint:syscalls:sys_enter_openat {
    @t[tid] = nsecs;
    @f[tid] = str(args->filename);
}

tracepoint:syscalls:sys_exit_openat
/@t[tid]/
{
    $lat = nsecs - @t[tid];
    if ($lat > 1000000) {      // 1 ms in ns
        printf("%-16s %6dus %s\n", comm, $lat / 1000, @f[tid]);
    }
    delete(@t[tid]);
    delete(@f[tid]);
}

END {
    clear(@t);
    clear(@f);
}

sudo bpftrace open-slow.bt

Probe types worth knowing

tracepoint:<subsystem>:<name> — stable, kernel-defined trace points. Best choice when one exists for what you want; immune to kernel version churn.
kprobe:<function> — entry of any kernel function. Unstable across kernel versions but vastly more general.
kretprobe:<function> — return of a kernel function. Useful for measuring latency or capturing return values.
uprobe:<binary>:<function> — entry of a user-space function. uprobe:/usr/bin/openssl:SSL_read traces every TLS read in any process that loads OpenSSL.
usdt:<binary>:<probe-name> — statically defined trace points the binary opted into (most modern databases, libc, JVMs have these).
profile:hz:<freq> — sample at N Hz across all CPUs. The basis for flame-graph profiling.
interval:<period> — fire every N seconds; useful for periodic snapshots of maps.

To discover what's available:

sudo bpftrace -l 'tracepoint:syscalls:*read*'
sudo bpftrace -l 'kprobe:tcp_*'
sudo bpftrace -l 'uprobe:/usr/lib/x86_64-linux-gnu/libc.so.6:*malloc*'

Aggregations and histograms

bpftrace's aggregations are the part that makes it actually faster than naive instrumentation: aggregation happens in the kernel, only the result crosses to user space.

// Linear histogram, bucket width 100us
@lat = lhist(elapsed_us, 0, 10000, 100);

// Exponential (default) histogram
@bytes = hist(args->count);

// Average, min, max, sum, count
@reads = stats(args->count);

// Per-key
@by_comm[comm] = count();      // count events grouped by process name
@by_pid[pid] = sum(args->count); // sum bytes by PID

Real example: tracing slow disk I/O

Question: which process is making disk I/O slow right now? Plain iostat shows device-level latency; bpftrace can pin it to PIDs.

sudo /usr/share/bcc/tools/biosnoop

Output: one line per block I/O, with PID, comm, device, R/W, sector, size, latency. Sort by latency, find the offenders.

Or build your own — the same thing in bpftrace, 12 lines:

#!/usr/bin/env bpftrace
// biosnoop-mini.bt

kprobe:blk_account_io_start { @start[arg0] = nsecs; }

tracepoint:block:block_rq_complete
/@start[args->sector]/
{
    $lat_us = (nsecs - @start[args->sector]) / 1000;
    printf("%-12s %-8d %-12s %5d us\n", comm, pid, args->rwbs, $lat_us);
    delete(@start[args->sector]);
}

Overhead

eBPF runs in the kernel, attached to events that are already happening. Adding a probe to a high-frequency function (e.g. every read()) costs some nanoseconds per call — usually well under 1% on a real workload, but it shows up if instrumenting truly hot paths. The reasonable rule: instrument first, optimize the instrumentation only if it visibly perturbs the workload.

What's not bpftrace's lane

Long-running production telemetry — for that, the structured-output BPF-based projects (parca-agent, pixie, cilium/tetragon, biotop running as a service) are the right tool.
Modifying behavior — bpftrace is read/aggregate only; full eBPF programs can edit packets, redirect syscalls, etc. but that's a C-with-libbpf workflow.
Kernels older than 4.18 — usable, but BTF and many tracepoints will be missing.