perf & bpftrace

Use perf for low-overhead sampling and bpftrace for event-driven answers. This page covers the commands, prerequisites, and the mistakes that waste the most time.

If you only remember six things

Start with the symptom: CPU saturation, scheduler latency, syscalls, I/O latency, or retransmits. The tool choice follows from the question.
Prefer sampling first. perf stat, perf top, and a 30-second perf record are usually enough to tell you where to dig.
Trace narrowly in production: one PID, one cgroup, one tracepoint, and a short capture window.
Readable stacks depend on prerequisites: symbols, frame pointers, debuginfo, and for eBPF especially BTF.
Kernel frames like __schedule or futex_wait often mean waiting, not burning CPU. Don't optimize the scheduler when the app is blocked.
Always save the exact command, kernel version, and time window so you can compare before and after a change.

On this page

Sampling vs tracing
Prereqs, permissions, and BTF
perf stat
perf top
perf record and perf report
Flame graphs
Useful bpftrace one-liners
Interpreting the output
Safety notes for production systems
Troubleshooting

Sampling vs tracing

Both tools answer performance questions, but they work differently. perf samples what the CPU was doing at intervals or counts hardware/software events. bpftrace runs tiny eBPF programs when a chosen event fires. The fastest path is usually sample first, trace second.

Question	Reach for	Why
What is burning CPU right now?	`perf top`	Live sampled view of hot symbols with low overhead.
Did this deploy increase instructions, cache misses, or context switches?	`perf stat`	Great for before/after baselines and quick sanity checks.
Which stacks consumed CPU over the last 30 seconds?	`perf record` + `perf report`	Creates a profile you can inspect later or turn into a flame graph.
Which process is flooding syscalls or hitting one kernel path?	`bpftrace`	Event-driven counters and histograms answer exact "who triggered this?" questions.
What is the distribution of latency?	`bpftrace`	Histograms are one of eBPF's strongest use cases.

Related: Start with lsof & strace when you need exact syscall arguments, and with Observability Overview when you need to decide whether the problem is better seen as metrics, logs, or traces.

Prereqs, permissions, and BTF

Tool output quality is mostly about environment quality. If the kernel forbids performance events, symbols are stripped, or BTF is missing, you get permission errors and [unknown] frames instead of answers.

# RHEL 9 / CentOS Stream
sudo dnf install -y perf bpftrace bpftool kernel-devel

# Debian / Ubuntu
sudo apt update && sudo apt install -y linux-perf bpftrace bpftool linux-headers-$(uname -r)

# See whether the running kernel exposes BTF
ls -l /sys/kernel/btf/vmlinux
sudo bpftool feature probe kernel | less

# Check perf restrictions
sysctl kernel.perf_event_paranoid
sysctl kernel.kptr_restrict

What you usually need:

Permissions: root is the easy path. On newer kernels, CAP_PERFMON and CAP_BPF can replace full root for some use cases, but production runbooks still usually use sudo.
Symbols and debuginfo: install debuginfo for your app and libc if you want human-readable userland stacks.
Frame pointers: --call-graph fp is fast and reliable when your binaries preserve frame pointers. If not, use DWARF call graphs at higher overhead.
BTF: modern bpftrace wants /sys/kernel/btf/vmlinux. Without it, many scripts fail or need kernel headers and manual type definitions.

BTF in one sentence: it is compact kernel type metadata. With BTF present, eBPF tools can resolve structs and tracepoints without guessing at kernel internals. On RHEL-family systems this usually "just works" on supported kernels; on custom kernels it is often the missing piece.

perf stat

perf stat is the fastest baseline you can take. It does not tell you where time went, but it tells you whether the workload is instruction-heavy, memory-stalled, branch-mispredict-heavy, or switching all over the place.

# Whole-system baseline for 30 seconds
sudo perf stat -d -d -d -- sleep 30

# Attach to one PID and print counters every second
sudo perf stat -p 4242 -I 1000

# Focus on the counters most people actually compare
sudo perf stat -e cycles,instructions,cache-misses,branches,branch-misses,context-switches,cpu-migrations,page-faults \
  -p 4242 -- sleep 15

 Performance counter stats for process id '4242':

      36,428,100,553      cycles
      19,004,242,887      instructions              #    0.52  insn per cycle
          68,117,004      cache-misses              #    7.84% of all cache refs
          14,921,447      context-switches
             132,884      cpu-migrations
           1,942,201      page-faults

      15.001777081 seconds time elapsed

IPC (instructions per cycle): rough rule of thumb is that low IPC plus high cache misses often means stalls, not a compute bottleneck.
Context switches: if they spike, check lock contention, thread oversubscription, or very small work units.
CPU migrations: high migrations can hurt cache locality. Sometimes a scheduler or affinity issue, sometimes just noise.
Page faults: a burst at startup is normal; sustained major faults under load is not.

perf top

perf top is the live view: "what is hot right now?" It is the fastest way to spot a bad regex, a compression routine melting CPU, or a lock-heavy kernel path.

# Whole-system hot symbols
sudo perf top

# One process, 99 Hz sample rate, with call graphs
sudo perf top -F 99 -p 4242 -g --call-graph fp

# Sort by command, shared object, and symbol
sudo perf top -p 4242 --sort comm,dso,symbol

If the hottest entries are user symbols in your app, you probably have a real CPU hot path. If the hottest entries are scheduler, futex, or idle transitions, the process may be waiting and merely getting sampled there when it wakes.

perf record and perf report

perf record captures a profile for later inspection. This is the default answer when perf top showed something interesting and you need a stable artifact, not a live screen.

# 30-second CPU profile of one process
sudo perf record -F 99 -g --call-graph fp -p 4242 -- sleep 30

# Inspect the profile in the terminal
sudo perf report --stdio --sort comm,dso,symbol

# Annotate one hot symbol down to the instruction level
sudo perf annotate --stdio -s sha256_block_data_order

For runtimes that do not preserve frame pointers, switch to DWARF call graphs and accept the extra cost:

sudo perf record -F 99 -g --call-graph dwarf,16384 -p 4242 -- sleep 30

Sampling frequency matters. -F 999 feels precise but is usually pointless and can become expensive on busy systems. Start at 49 or 99 Hz unless you already know you need finer resolution.

Flame graphs

A flame graph is just another way of reading the same profile. Width represents inclusive samples. The widest box near the top is where overall CPU time accumulates, even if the true leaf work is deeper in the stack.

# Record first
sudo perf record -F 99 -g --call-graph fp -p 4242 -- sleep 30

# Convert to folded stacks, then to SVG
sudo perf script > out.perf
stackcollapse-perf.pl out.perf > out.folded
flamegraph.pl --color=java out.folded > cpu-flame.svg

Read it like this:

A wide top box is broad inclusive cost in that function and everything under it.
A very deep narrow tower often means one path is complex but not necessarily the biggest time sink.
If the flame is mostly allocator, memcpy, GC, or crypto library frames, the app is paying for memory movement or runtime housekeeping rather than business logic.

For service-wide context around latency and saturation, pair the profile with dashboards from Grafana Basics. A flame graph tells you where; time-series telemetry tells you when and how often.

Useful bpftrace one-liners

Use bpftrace when you know the event you care about and want counts, histograms, or stacks attached to that event. Prefer tracepoints over kprobes when both exist: they are more stable across kernel versions.

Who is generating the most syscalls?

sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }
interval:s:10 { print(@, 10); clear(@); }'

Who is opening files the most?

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }
interval:s:5 { print(@, 10); clear(@); }'

Sample user stacks for one hot PID

sudo bpftrace -e 'profile:hz:99 /pid == 4242/ { @[ustack] = count(); }'

Where are TCP retransmits coming from?

sudo bpftrace -e 'kprobe:tcp_retransmit_skb { @[kstack] = count(); }'

Histogram of block I/O latency

sudo bpftrace -e '
tracepoint:block:block_rq_issue { @start[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@start[args->sector]/ {
  @usecs = hist((nsecs - @start[args->sector]) / 1000);
  delete(@start[args->sector]);
}'

Which commands are creating TCP connections?

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_connect { @[comm] = count(); }
interval:s:5 { print(@, 10); clear(@); }'

Three habits keep these useful:

Time-box them. Most one-liners should run for 10-60 seconds, not all afternoon.
Use predicates like /pid == 4242/ whenever possible.
Expect kernels to differ. A kprobe name that exists on one release may be renamed on another. Tracepoints are steadier.

Interpreting the output

What you see	Usually means	Next step
`perf top` dominated by one application symbol	A real CPU hot path in your code	Take a `perf record`, inspect callers, then optimize or cache.
Heavy `__schedule`, `futex_wait`, or epoll frames	The process is mostly waiting	Look at lock contention, I/O wait, queue depth, or upstream latency.
Low IPC and high cache misses in `perf stat`	Stalls on memory or poor locality	Check object churn, data layout, NUMA placement, or oversized working sets.
bpftrace histogram with a long right tail	Most ops are fine but some are very slow	Find which queue, disk, or remote dependency produces the tail.
Retransmit stacks in network code	Packet loss or downstream congestion	Correlate with Wireshark & tshark or interface counters.

Do not read percentages in isolation. A function taking 20% of samples on an otherwise idle box may be harmless; the same 20% on a pinned core during an incident is your outage.

Safety notes for production systems

These tools are low overhead, not zero overhead. Production-safe means "used deliberately": short windows, scoped targets, and frequencies that fit the question.

Prefer perf stat and perf top before custom eBPF. They answer many questions with less setup.
Profile one process or cgroup instead of the whole machine when the incident is isolated.
Avoid wildcard probes on production kernels. kprobe:* is how you turn observability into a new problem.
Capture metadata with the profile: uname -r, perf --version, app build ID, and the dashboard time range.
If you are debugging a latency incident, record a short profile during the bad window and another after recovery. Comparison is more valuable than absolute numbers.

uname -r
perf --version
bpftrace --version
date -Is

Troubleshooting

Symptom	Likely cause	Fix
`perf_event_open ... Operation not permitted`	Kernel restrictions too high for the current user	Run with `sudo`, or lower `kernel.perf_event_paranoid` in a controlled manner.
`[unknown]` or poor userland stacks	Missing symbols, stripped binaries, or no frame pointers	Install debuginfo and retry with `--call-graph fp` or `dwarf`.
`bpftrace: BTF not found`	The running kernel does not expose BTF metadata	Use a distro kernel with BTF enabled, or install the matching kernel headers/debuginfo and adjust the script.
kprobe attach fails	Kernel symbol changed or is not exported on this release	Prefer a tracepoint if available, or inspect `/proc/kallsyms` on the target kernel.
Java or Go stacks are incomplete	Runtime or build flags do not preserve frames	For JVMs use frame pointers where possible; otherwise use DWARF and accept higher cost.
The profile says CPU is hot, but service latency says waiting	You sampled wakeups, scheduler paths, or a short burst	Cross-check with systemd & journalctl, Grafana Basics, and a second capture.