perf & bpftrace
- Start with the symptom: CPU saturation, scheduler latency, syscalls, I/O latency, or retransmits. The tool choice follows from the question.
- Prefer sampling first.
perf stat,perf top, and a 30-secondperf recordare usually enough to tell you where to dig. - Trace narrowly in production: one PID, one cgroup, one tracepoint, and a short capture window.
- Readable stacks depend on prerequisites: symbols, frame pointers, debuginfo, and for eBPF especially BTF.
- Kernel frames like
__scheduleorfutex_waitoften mean waiting, not burning CPU. Don't optimize the scheduler when the app is blocked. - Always save the exact command, kernel version, and time window so you can compare before and after a change.
Sampling vs tracing
Both tools answer performance questions, but they work differently. perf samples what the CPU was doing at intervals or counts hardware/software events. bpftrace runs tiny eBPF programs when a chosen event fires. The fastest path is usually sample first, trace second.
| Question | Reach for | Why |
|---|---|---|
| What is burning CPU right now? | perf top | Live sampled view of hot symbols with low overhead. |
| Did this deploy increase instructions, cache misses, or context switches? | perf stat | Great for before/after baselines and quick sanity checks. |
| Which stacks consumed CPU over the last 30 seconds? | perf record + perf report | Creates a profile you can inspect later or turn into a flame graph. |
| Which process is flooding syscalls or hitting one kernel path? | bpftrace | Event-driven counters and histograms answer exact "who triggered this?" questions. |
| What is the distribution of latency? | bpftrace | Histograms are one of eBPF's strongest use cases. |
Prereqs, permissions, and BTF
Tool output quality is mostly about environment quality. If the kernel forbids performance events, symbols are stripped, or BTF is missing, you get permission errors and [unknown] frames instead of answers.
# RHEL 9 / CentOS Stream
sudo dnf install -y perf bpftrace bpftool kernel-devel
# Debian / Ubuntu
sudo apt update && sudo apt install -y linux-perf bpftrace bpftool linux-headers-$(uname -r)
# See whether the running kernel exposes BTF
ls -l /sys/kernel/btf/vmlinux
sudo bpftool feature probe kernel | less
# Check perf restrictions
sysctl kernel.perf_event_paranoid
sysctl kernel.kptr_restrict
What you usually need:
- Permissions: root is the easy path. On newer kernels,
CAP_PERFMONandCAP_BPFcan replace full root for some use cases, but production runbooks still usually usesudo. - Symbols and debuginfo: install debuginfo for your app and libc if you want human-readable userland stacks.
- Frame pointers:
--call-graph fpis fast and reliable when your binaries preserve frame pointers. If not, use DWARF call graphs at higher overhead. - BTF: modern
bpftracewants/sys/kernel/btf/vmlinux. Without it, many scripts fail or need kernel headers and manual type definitions.
perf stat
perf stat is the fastest baseline you can take. It does not tell you where time went, but it tells you whether the workload is instruction-heavy, memory-stalled, branch-mispredict-heavy, or switching all over the place.
# Whole-system baseline for 30 seconds
sudo perf stat -d -d -d -- sleep 30
# Attach to one PID and print counters every second
sudo perf stat -p 4242 -I 1000
# Focus on the counters most people actually compare
sudo perf stat -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations,page-faults \
-p 4242 -- sleep 15
Performance counter stats for process id '4242':
36,428,100,553 cycles
19,004,242,887 instructions # 0.52 insn per cycle
68,117,004 cache-misses # 7.84% of all cache refs
14,921,447 context-switches
132,884 cpu-migrations
1,942,201 page-faults
15.001777081 seconds time elapsed
- IPC (instructions per cycle): rough rule of thumb is that low IPC plus high cache misses often means stalls, not a compute bottleneck.
- Context switches: if they spike, check lock contention, thread oversubscription, or very small work units.
- CPU migrations: high migrations can hurt cache locality. Sometimes a scheduler or affinity issue, sometimes just noise.
- Page faults: a burst at startup is normal; sustained major faults under load is not.
perf top
perf top is the live view: "what is hot right now?" It is the fastest way to spot a bad regex, a compression routine melting CPU, or a lock-heavy kernel path.
# Whole-system hot symbols
sudo perf top
# One process, 99 Hz sample rate, with call graphs
sudo perf top -F 99 -p 4242 -g --call-graph fp
# Sort by command, shared object, and symbol
sudo perf top -p 4242 --sort comm,dso,symbol
If the hottest entries are user symbols in your app, you probably have a real CPU hot path. If the hottest entries are scheduler, futex, or idle transitions, the process may be waiting and merely getting sampled there when it wakes.
perf record and perf report
perf record captures a profile for later inspection. This is the default answer when perf top showed something interesting and you need a stable artifact, not a live screen.
# 30-second CPU profile of one process
sudo perf record -F 99 -g --call-graph fp -p 4242 -- sleep 30
# Inspect the profile in the terminal
sudo perf report --stdio --sort comm,dso,symbol
# Annotate one hot symbol down to the instruction level
sudo perf annotate --stdio -s sha256_block_data_order
For runtimes that do not preserve frame pointers, switch to DWARF call graphs and accept the extra cost:
sudo perf record -F 99 -g --call-graph dwarf,16384 -p 4242 -- sleep 30
-F 999 feels precise but is usually pointless and can become expensive on busy systems. Start at 49 or 99 Hz unless you already know you need finer resolution.
Flame graphs
A flame graph is just another way of reading the same profile. Width represents inclusive samples. The widest box near the top is where overall CPU time accumulates, even if the true leaf work is deeper in the stack.
# Record first
sudo perf record -F 99 -g --call-graph fp -p 4242 -- sleep 30
# Convert to folded stacks, then to SVG
sudo perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl --color=java out.folded > cpu-flame.svg
Read it like this:
- A wide top box is broad inclusive cost in that function and everything under it.
- A very deep narrow tower often means one path is complex but not necessarily the biggest time sink.
- If the flame is mostly allocator, memcpy, GC, or crypto library frames, the app is paying for memory movement or runtime housekeeping rather than business logic.
For service-wide context around latency and saturation, pair the profile with dashboards from Grafana Basics. A flame graph tells you where; time-series telemetry tells you when and how often.
Useful bpftrace one-liners
Use bpftrace when you know the event you care about and want counts, histograms, or stacks attached to that event. Prefer tracepoints over kprobes when both exist: they are more stable across kernel versions.
Who is generating the most syscalls?
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }
interval:s:10 { print(@, 10); clear(@); }'
Who is opening files the most?
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }
interval:s:5 { print(@, 10); clear(@); }'
Sample user stacks for one hot PID
sudo bpftrace -e 'profile:hz:99 /pid == 4242/ { @[ustack] = count(); }'
Where are TCP retransmits coming from?
sudo bpftrace -e 'kprobe:tcp_retransmit_skb { @[kstack] = count(); }'
Histogram of block I/O latency
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->sector]/ {
@usecs = hist((nsecs - @start[args->sector]) / 1000);
delete(@start[args->sector]);
}'
Which commands are creating TCP connections?
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_connect { @[comm] = count(); }
interval:s:5 { print(@, 10); clear(@); }'
Three habits keep these useful:
- Time-box them. Most one-liners should run for 10-60 seconds, not all afternoon.
- Use predicates like
/pid == 4242/whenever possible. - Expect kernels to differ. A kprobe name that exists on one release may be renamed on another. Tracepoints are steadier.
Interpreting the output
| What you see | Usually means | Next step |
|---|---|---|
perf top dominated by one application symbol | A real CPU hot path in your code | Take a perf record, inspect callers, then optimize or cache. |
Heavy __schedule, futex_wait, or epoll frames | The process is mostly waiting | Look at lock contention, I/O wait, queue depth, or upstream latency. |
Low IPC and high cache misses in perf stat | Stalls on memory or poor locality | Check object churn, data layout, NUMA placement, or oversized working sets. |
| bpftrace histogram with a long right tail | Most ops are fine but some are very slow | Find which queue, disk, or remote dependency produces the tail. |
| Retransmit stacks in network code | Packet loss or downstream congestion | Correlate with Wireshark & tshark or interface counters. |
Do not read percentages in isolation. A function taking 20% of samples on an otherwise idle box may be harmless; the same 20% on a pinned core during an incident is your outage.
Safety notes for production systems
- Prefer
perf statandperf topbefore custom eBPF. They answer many questions with less setup. - Profile one process or cgroup instead of the whole machine when the incident is isolated.
- Avoid wildcard probes on production kernels.
kprobe:*is how you turn observability into a new problem. - Capture metadata with the profile:
uname -r,perf --version, app build ID, and the dashboard time range. - If you are debugging a latency incident, record a short profile during the bad window and another after recovery. Comparison is more valuable than absolute numbers.
uname -r
perf --version
bpftrace --version
date -Is
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
perf_event_open ... Operation not permitted | Kernel restrictions too high for the current user | Run with sudo, or lower kernel.perf_event_paranoid in a controlled manner. |
[unknown] or poor userland stacks | Missing symbols, stripped binaries, or no frame pointers | Install debuginfo and retry with --call-graph fp or dwarf. |
bpftrace: BTF not found | The running kernel does not expose BTF metadata | Use a distro kernel with BTF enabled, or install the matching kernel headers/debuginfo and adjust the script. |
| kprobe attach fails | Kernel symbol changed or is not exported on this release | Prefer a tracepoint if available, or inspect /proc/kallsyms on the target kernel. |
| Java or Go stacks are incomplete | Runtime or build flags do not preserve frames | For JVMs use frame pointers where possible; otherwise use DWARF and accept higher cost. |
| The profile says CPU is hot, but service latency says waiting | You sampled wakeups, scheduler paths, or a short burst | Cross-check with systemd & journalctl, Grafana Basics, and a second capture. |