Troubleshooting Workflow

Page 17 — A repeatable process for diagnosing Linux service problems.

The short version

check status
check logs
check config
check network
check auth/certs/time
change one thing at a time

On this page

Step 0: systemctl list-units --failed
Step 1: Define the symptom
Step 2: Check service state
Step 3: Validate config
Step 4: Check network and ports
Step 5: Check DNS
Step 6: Check auth, certs, and time
Step 7: Check permissions
Step 8: Change one thing at a time
Common fault domains
Resource exhaustion
Container namespace hint
conntrack & tcpdump
Jump to the right page

Step 0: systemctl list-units --failed

Before anything else, ask systemd what it thinks is broken. One command, no grepping, no assumptions — if a unit died, it will be listed here with the time of failure.

systemctl list-units --failed
# Or with more detail (shows the result cause: signal, exit-code, timeout, oom-kill...)
systemctl --failed --no-pager

# If you just rebooted, see what failed during boot
systemd-analyze critical-chain
journalctl -p err -b   # all errors since last boot

This is the fastest "what's actually broken on this host?" command. It catches timers, mounts, sockets, and one-shot units that otherwise stay invisible. If nothing is listed, the problem is either user-space-only or lives inside a healthy unit (e.g. nginx running but serving 502s) — continue to Step 1.

Step 1: Define the symptom

Not "it's broken." Say exactly what is wrong:

nginx won't start
website returns 502 Bad Gateway
SSH key auth fails
mail is stuck in the queue
host cannot resolve DNS
cert appears expired
log forwarding stopped

Write it down. A clear description of the symptom prevents you from chasing the wrong problem. It also helps when asking someone else for help.

Step 2: Check service state

systemctl status SERVICE
journalctl -u SERVICE -n 50
journalctl -u SERVICE --since today

Why: if the service is dead, fix that first. Logs often tell you exactly why startup failed — look for failed, error, or the last line before it stopped.

Step 3: Validate config

Use the service's built-in config validation tool:

nginx -t
apachectl configtest
postfix check
doveconf

Why: many failures are just bad config syntax or invalid values. Fixing a syntax error is faster than debugging a running service.

Step 4: Check network and ports

ss -tulpn
nc -zv host port
curl -vk https://host
ip a
ip r

Questions to ask:

Is the service listening on the expected port? (ss -tulpn)
Is the port blocked by a firewall? (firewall-cmd --list-all)
Can you reach the upstream or backend from this host?
Is the routing correct?

Step 5: Check DNS

dig host
host host
nslookup host

Questions to ask:

Does the hostname resolve to the expected IP?
Is the DNS server configured correctly (/etc/resolv.conf)?
Is there a reverse DNS entry if needed?
Is a search domain causing unexpected name resolution?

Step 6: Check auth, certs, and time

kinit
klist
openssl x509 -in cert.crt -noout -dates
chronyc tracking
timedatectl

Things to verify:

Kerberos ticket is valid and not expired
Certificate is within validity dates
Certificate CN / SAN matches the hostname being connected to
System time is in sync — Kerberos requires less than 5 minutes clock skew

Step 7: Check permissions and files

ls -l
namei -l /path/to/file

Common permission issues:

Config file not readable by the service user
Private key permissions too open (SSH / TLS services refuse these)
Log directory not writable by the service
SELinux label wrong: ls -lZ, ausearch -m avc

Step 8: Change one thing at a time

Discipline: Change one thing, then retest before changing anything else. If you change multiple things at once, you will not know which fix mattered — or what broke something new.

After each change:

Restart or reload the relevant service
Re-run your test
Check logs again
Note what changed and what effect it had

Common fault domains

When stuck, work through this list:

Service down — not started, crashed, or failed on startup
Bad config — syntax error, wrong path, missing directive
Bad permissions — file or directory not accessible to the service
Port blocked — firewall rule, SELinux, wrong bind address
DNS wrong — hostname does not resolve, or resolves to the wrong IP
Cert expired or mismatched — check dates and hostname
Auth broken — wrong credentials, expired ticket, missing key
Time skew — Kerberos will fail if clocks are too far apart
Automation rendered wrong file — a template produced unexpected output
Resource exhaustion — disk full, inode exhaustion, OOM — see below

Resource exhaustion

Services fail silently or in confusing ways when a system runs out of disk space, inodes, memory, or file descriptors. These are easy to miss because the error messages often point elsewhere.

Disk space

df -h      # show disk usage by filesystem (human-readable)
df -i      # show inode usage — a full inode table also stops writes

A filesystem can have free space but exhausted inodes — you will not be able to create new files. Always check both.

Common symptom: A service fails to write logs, start, or create temp files with "No space left on device" — even if df -h shows space available. Check df -i.

Find what is consuming space:

du -sh /*             # top-level usage
du -sh /var/log/*    # check log directories specifically

OOM (Out of Memory)

dmesg | grep -i oom           # kernel OOM killer log
journalctl -k | grep -i oom   # same via systemd journal

The OOM killer terminates processes when memory is critically low. If a service or process disappeared without explanation, check for OOM kills first.

free -h    # check current memory and swap usage

File descriptor limits

ulimit -n                    # your shell's current limit
cat /proc/PID/limits         # limits for a running process (replace PID)

High-traffic services (nginx, databases) can hit the open file descriptor limit under load. When they do, they fail to accept new connections even though the service is technically running. Raise the limit in the service unit file with LimitNOFILE=65535.

Container namespace hint

If the service runs inside a container (Podman, Docker, Kubernetes) its network, mount and PID view are isolated. curl from the host may succeed while the app inside sees a different world — you have to enter the namespace to debug it.

# List all namespaces on the host, grouped by PID
lsns
lsns -t net        # only the network namespaces

# Enter the network namespace of a running container (by its PID)
sudo nsenter -t <PID> -n ss -tlnp        # what is *actually* listening
sudo nsenter -t <PID> -n ip a            # the container's view of interfaces
sudo nsenter -t <PID> -n curl -v http://127.0.0.1:8080/

# Enter *all* namespaces of that PID — feels like attaching into the container
sudo nsenter -t <PID> -a
# (equivalent to: podman exec -it <id> sh, but works when exec is broken)

Get the PID with podman inspect --format '{{.State.Pid}}' <ctr>, docker inspect --format '{{.State.Pid}}' <ctr>, or crictl inspect -o json <id> | jq .info.pid. The core pattern is sudo nsenter -t PID -n <command>: swap in the real PID, then run ss, ip, ping, curl, or tcpdump inside the container's own network stack without installing debug tools into the image.

conntrack & tcpdump

When "the service is running and the port is open but connections still fail", look at what the kernel is actually doing to the packets. conntrack exposes the netfilter connection table (so you can see NAT mappings, stuck TIME_WAIT entries, and blocked SYNs) and tcpdump shows you the bytes on the wire.

# What connections does the kernel currently track for host X?
conntrack -L | grep 192.0.2.10

# Live stream of NEW events — catch handshakes as they happen (or don't)
conntrack -E -e NEW

# Clear a stale entry that is blocking a retransmit
conntrack -D -s 192.0.2.10

# Classic tcpdump filter: HTTP(S) to/from a specific host on one interface
tcpdump -nn -i eth0 'host 192.0.2.10 and (tcp port 80 or tcp port 443)'

# SYNs only (is the SYN even reaching us? are we replying?)
tcpdump -nn -i any 'tcp[tcpflags] & (tcp-syn) != 0 and host 192.0.2.10'

# Write a trace you can open in Wireshark later
tcpdump -nn -i eth0 -w /tmp/cap.pcap 'host 192.0.2.10 and port 443'

A typical use: curl times out with no error in the service log. Run conntrack -E -e NEW and reproduce — if no NEW entry appears, the packet is being dropped by the firewall before conntrack (check iptables -nvL / nft list ruleset). If an entry appears but the state stays SYN_SENT, the remote end isn't replying — pivot to tcpdump and the routing/MTU side. See Wireshark for analysing the resulting .pcap.

Jump to the right page

Service-specific playbooks: once the symptom is narrowed to one service, the per-service pages have the relevant -t / check commands, common error patterns, and firewall/SELinux gotchas already written down.

Web / reverse proxy — Nginx, Apache
Mail — Postfix (MTA / queue) and Dovecot (IMAP/POP3)
Forward proxy — Squid
Logging — Rsyslog
Time — Chrony (check this first for Kerberos/TLS weirdness)

Next: 18 · Glossary →