Practical Profiling with perf on Linux

Overview

What perf is

perf is the user-space front-end to the Linux perf_event subsystem (the perf_event_open(2) syscall). It offers a uniform command-line interface to hardware Performance-Monitoring Units (PMU), kernel trace-points, software counters, kprobes/uprobes, and eBPF events, hiding the architectural quirks of each CPU family. It ships in the kernel tree (tools/perf) and is packaged by most distributions as linux-tools-$(uname -r).1

Why you might care

  • Pinpoint hot spots — attribute CPU cycles, stalled slots, cache-misses or branch mis-predictions to lines of code without recompiling.
  • Measure whole-system behaviour — sample across all tasks (-a) or within cgroups, profile kernels, containers and virtual machines (perf kvm).
  • Low overhead, production-safe — sampling incurs < 1–5 % overhead at typical frequencies.
  • Rich ecosystem — outputs feed directly into FlameGraph, speedscope, or BPF-based visualizers.
  • Always up-to-date — because perf ships with each kernel release, new CPU features (Hybrid/Big-LITTLE core distinctions, Intel LBR call-stacks, AMD IBS, Arm SPE, etc.) appear as soon as your distro updates its kernel.2

Mental model

Concept What it means Example flags
Event Thing you count or sample -e cycles, -e mem-loads,mem-stores
Counting vs. Sampling Simple aggregate counters (perf stat) vs. periodic PC/IP snapshots (perf record -F99) perf stat, perf record
Call-graph mode Capture stack traces with frame-pointer unwind, DWARF, Last-Branch-Records or BPF -g fp, -g dwarf, --call-graph lbr
perf.data Binary file holding raw samples; post-processed by sub-commands perf report, perf script

Quick-start workflow

# 1. Get a high-level performance baseline
perf stat -e cycles,instructions,cache-misses ./my_app

# 2. Sample 99 Hz, record user-space call-graphs, system-wide (-a)
sudo perf record -F 99 -g -a -- ./my_loadgen.sh

# 3. Inspect the profile (TUI or stdio)
perf report            # interactive
perf report --stdio    # pipe to less

# 4. Zoom into specific functions/lines
perf annotate -p <PID>

perf top gives a live, htop-like view, continuously refreshing hottest symbols.3 4

Most-used sub-commands

  • perf stat aggregate counters for one run, multiple events in parallel.
  • perf record / report / annotate sampling → analysis cycle; supports user stacks, kernel stacks, mixed mode.
  • perf top real-time sampling (hit a to toggle kernel/user).
  • perf trace lightweight syscall/ftrace tracer (a faster strace).
  • perf sched detect run-queue latency and involuntary context-switch delays.
  • perf mem –stat / –live NUMA and memory-access profiling.
  • perf c2c cache-to-cache false-sharing detector on multi-socket systems.
  • perf bench micro-benchmarks for cpuhog, syscall, numa, memset, etc.

Choosing and grouping events

# Sample multiple hardware events as a group so they are counted together
sudo perf record -e '{cycles,cache-misses,branch-misses}:u' -c 100000 -g ./app

# Use raw event codes if the alias is missing on your CPU
# Intel “frontend stalled” & “cycle activity:stalls_l2_miss”
sudo perf stat -e r003c,r0041 ./app

Individual events can be limited to user (:u), kernel (:k), or hypervisor (:h) privilege levels. Event selection varies by micro-architecture, but perf list prints everything supported on the running kernel.

Scope and filtering

  • Per-PID / thread-p PID, -t TID
  • CPU mask-C 0-3,6 (profile big cores only)
  • Duration / delay--timeout 10s, --delay 5
  • cgroup--cgroup=/sys/fs/cgroup/my_ctn (container-only profiling)

Call-graph collection nuances

Mode Pros Cons Kernel ≥ 6.8 notes
-g fp (frame-pointer) zero config, low overhead needs FP-enabled build default on many distros
-g dwarf works with FP-omitted builds higher unwind cost faster thanks to ORC
--call-graph lbr (Intel/AMD) near-zero overhead, deep stacks requires hardware LBR hybrid-core aware
--call-graph lbr,bpf uses BPF helper to unwind userspace best for mixed languages new in kernel 6.9

Stacks from JIT runtimes (JVM, V8, .NET) need perf map support or BPF CO-RE unwinders.

Post-processing and visualisation

# Convert to folded stacks for FlameGraph
perf script | stackcollapse-perf.pl > out.folded
flamegraph.pl out.folded > flame.svg

# Generate speedscope JSON
perf script -F +pid,comm,ip,sym | perf_script_speedscope > profile.speedscope.json

perf inject can post-process LBR data to enrich samples, and perf archive bundles symbol files for offline analysis.

Practical tips & caveats

  • Privileges: events marked “Precise” or raw PMU codes usually require sudo or perf_event_paranoid = 1.
  • Match versions: userspace perf must match (or be newer than) the running kernel; perf version prints both hashes.
  • Build-ID cache: perf buildid-cache --add /path/to/lib.so ensures symbols for stripped binaries.
  • Minimise distortion: prefer period-based sampling (-c) for very short-lived functions; throttle frequency on production (e.g. -F 400).
  • Containerised kernels: under Kubernetes use --cgroup or --uid filters; ensure /proc/sys/kernel/perf_event_paranoid inside the host allows sampling.
  • Hybrid (P-/E-core) systems: pin to core type with --cpu-type=performance (kernel 6.8+) to avoid mixed counters.

Conclusion

perf combines a profiler, tracer, and benchmark suite into a single first-party, always-available tool. By counting or sampling nearly every performance-relevant event the kernel exposes, it lets you move from “the code feels slow” to a quantified, line-level diagnosis in minutes—without invasive instrumentation or proprietary SDKs. Armed with the commands above you can start measuring before guessing and make data-driven optimisation a routine part of your Linux workflow.