perf & bpftrace

Use perf for low-overhead sampling and bpftrace for event-driven answers. This page covers the commands, prerequisites, and the mistakes that waste the most time.

If you only remember six things
  • Start with the symptom: CPU saturation, scheduler latency, syscalls, I/O latency, or retransmits. The tool choice follows from the question.
  • Prefer sampling first. perf stat, perf top, and a 30-second perf record are usually enough to tell you where to dig.
  • Trace narrowly in production: one PID, one cgroup, one tracepoint, and a short capture window.
  • Readable stacks depend on prerequisites: symbols, frame pointers, debuginfo, and for eBPF especially BTF.
  • Kernel frames like __schedule or futex_wait often mean waiting, not burning CPU. Don't optimize the scheduler when the app is blocked.
  • Always save the exact command, kernel version, and time window so you can compare before and after a change.

Sampling vs tracing

Both tools answer performance questions, but they work differently. perf samples what the CPU was doing at intervals or counts hardware/software events. bpftrace runs tiny eBPF programs when a chosen event fires. The fastest path is usually sample first, trace second.

QuestionReach forWhy
What is burning CPU right now?perf topLive sampled view of hot symbols with low overhead.
Did this deploy increase instructions, cache misses, or context switches?perf statGreat for before/after baselines and quick sanity checks.
Which stacks consumed CPU over the last 30 seconds?perf record + perf reportCreates a profile you can inspect later or turn into a flame graph.
Which process is flooding syscalls or hitting one kernel path?bpftraceEvent-driven counters and histograms answer exact "who triggered this?" questions.
What is the distribution of latency?bpftraceHistograms are one of eBPF's strongest use cases.
Related: Start with lsof & strace when you need exact syscall arguments, and with Observability Overview when you need to decide whether the problem is better seen as metrics, logs, or traces.

Prereqs, permissions, and BTF

Tool output quality is mostly about environment quality. If the kernel forbids performance events, symbols are stripped, or BTF is missing, you get permission errors and [unknown] frames instead of answers.

# RHEL 9 / CentOS Stream
sudo dnf install -y perf bpftrace bpftool kernel-devel

# Debian / Ubuntu
sudo apt update && sudo apt install -y linux-perf bpftrace bpftool linux-headers-$(uname -r)

# See whether the running kernel exposes BTF
ls -l /sys/kernel/btf/vmlinux
sudo bpftool feature probe kernel | less

# Check perf restrictions
sysctl kernel.perf_event_paranoid
sysctl kernel.kptr_restrict

What you usually need:

BTF in one sentence: it is compact kernel type metadata. With BTF present, eBPF tools can resolve structs and tracepoints without guessing at kernel internals. On RHEL-family systems this usually "just works" on supported kernels; on custom kernels it is often the missing piece.

perf stat

perf stat is the fastest baseline you can take. It does not tell you where time went, but it tells you whether the workload is instruction-heavy, memory-stalled, branch-mispredict-heavy, or switching all over the place.

# Whole-system baseline for 30 seconds
sudo perf stat -d -d -d -- sleep 30

# Attach to one PID and print counters every second
sudo perf stat -p 4242 -I 1000

# Focus on the counters most people actually compare
sudo perf stat -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations,page-faults \
  -p 4242 -- sleep 15
 Performance counter stats for process id '4242':

      36,428,100,553      cycles
      19,004,242,887      instructions              #    0.52  insn per cycle
          68,117,004      cache-misses              #    7.84% of all cache refs
          14,921,447      context-switches
             132,884      cpu-migrations
           1,942,201      page-faults

      15.001777081 seconds time elapsed

perf top

perf top is the live view: "what is hot right now?" It is the fastest way to spot a bad regex, a compression routine melting CPU, or a lock-heavy kernel path.

# Whole-system hot symbols
sudo perf top

# One process, 99 Hz sample rate, with call graphs
sudo perf top -F 99 -p 4242 -g --call-graph fp

# Sort by command, shared object, and symbol
sudo perf top -p 4242 --sort comm,dso,symbol

If the hottest entries are user symbols in your app, you probably have a real CPU hot path. If the hottest entries are scheduler, futex, or idle transitions, the process may be waiting and merely getting sampled there when it wakes.

perf record and perf report

perf record captures a profile for later inspection. This is the default answer when perf top showed something interesting and you need a stable artifact, not a live screen.

# 30-second CPU profile of one process
sudo perf record -F 99 -g --call-graph fp -p 4242 -- sleep 30

# Inspect the profile in the terminal
sudo perf report --stdio --sort comm,dso,symbol

# Annotate one hot symbol down to the instruction level
sudo perf annotate --stdio -s sha256_block_data_order

For runtimes that do not preserve frame pointers, switch to DWARF call graphs and accept the extra cost:

sudo perf record -F 99 -g --call-graph dwarf,16384 -p 4242 -- sleep 30
Sampling frequency matters. -F 999 feels precise but is usually pointless and can become expensive on busy systems. Start at 49 or 99 Hz unless you already know you need finer resolution.

Flame graphs

A flame graph is just another way of reading the same profile. Width represents inclusive samples. The widest box near the top is where overall CPU time accumulates, even if the true leaf work is deeper in the stack.

# Record first
sudo perf record -F 99 -g --call-graph fp -p 4242 -- sleep 30

# Convert to folded stacks, then to SVG
sudo perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl --color=java out.folded > cpu-flame.svg

Read it like this:

For service-wide context around latency and saturation, pair the profile with dashboards from Grafana Basics. A flame graph tells you where; time-series telemetry tells you when and how often.

Useful bpftrace one-liners

Use bpftrace when you know the event you care about and want counts, histograms, or stacks attached to that event. Prefer tracepoints over kprobes when both exist: they are more stable across kernel versions.

Who is generating the most syscalls?

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }
interval:s:10 { print(@, 10); clear(@); }'

Who is opening files the most?

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }
interval:s:5 { print(@, 10); clear(@); }'

Sample user stacks for one hot PID

sudo bpftrace -e 'profile:hz:99 /pid == 4242/ { @[ustack] = count(); }'

Where are TCP retransmits coming from?

sudo bpftrace -e 'kprobe:tcp_retransmit_skb { @[kstack] = count(); }'

Histogram of block I/O latency

sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->sector]/ {
  @usecs = hist((nsecs - @start[args->sector]) / 1000);
  delete(@start[args->sector]);
}'

Which commands are creating TCP connections?

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_connect { @[comm] = count(); }
interval:s:5 { print(@, 10); clear(@); }'

Three habits keep these useful:

Interpreting the output

What you seeUsually meansNext step
perf top dominated by one application symbolA real CPU hot path in your codeTake a perf record, inspect callers, then optimize or cache.
Heavy __schedule, futex_wait, or epoll framesThe process is mostly waitingLook at lock contention, I/O wait, queue depth, or upstream latency.
Low IPC and high cache misses in perf statStalls on memory or poor localityCheck object churn, data layout, NUMA placement, or oversized working sets.
bpftrace histogram with a long right tailMost ops are fine but some are very slowFind which queue, disk, or remote dependency produces the tail.
Retransmit stacks in network codePacket loss or downstream congestionCorrelate with Wireshark & tshark or interface counters.

Do not read percentages in isolation. A function taking 20% of samples on an otherwise idle box may be harmless; the same 20% on a pinned core during an incident is your outage.

Safety notes for production systems

These tools are low overhead, not zero overhead. Production-safe means "used deliberately": short windows, scoped targets, and frequencies that fit the question.
uname -r
perf --version
bpftrace --version
date -Is

Troubleshooting

SymptomLikely causeFix
perf_event_open ... Operation not permittedKernel restrictions too high for the current userRun with sudo, or lower kernel.perf_event_paranoid in a controlled manner.
[unknown] or poor userland stacksMissing symbols, stripped binaries, or no frame pointersInstall debuginfo and retry with --call-graph fp or dwarf.
bpftrace: BTF not foundThe running kernel does not expose BTF metadataUse a distro kernel with BTF enabled, or install the matching kernel headers/debuginfo and adjust the script.
kprobe attach failsKernel symbol changed or is not exported on this releasePrefer a tracepoint if available, or inspect /proc/kallsyms on the target kernel.
Java or Go stacks are incompleteRuntime or build flags do not preserve framesFor JVMs use frame pointers where possible; otherwise use DWARF and accept higher cost.
The profile says CPU is hot, but service latency says waitingYou sampled wakeups, scheduler paths, or a short burstCross-check with systemd & journalctl, Grafana Basics, and a second capture.