Service-Specific Troubleshooting
- General first steps — always
- nginx not responding
- Postfix mail not sending
- SSH connection failing
- Time sync problems
- Login / authentication failing
- DNS resolution failing
- Blocked by SELinux
- Blocked by firewall
- Disk full / inodes exhausted
- Postfix queue management
- Advanced systemd troubleshooting
- Backend dependency down (DB / cache / queue)
General first steps — always
Before diving into service-specific steps, do these first for any broken service:
# 1. Is the service running?
systemctl status SERVICE_NAME
# 2. What do the logs say?
journalctl -u SERVICE_NAME -n 50
journalctl -u SERVICE_NAME --since "10 minutes ago"
# 3. Is it listening on the right port?
ss -tlnp | grep SERVICE_NAME
# 4. Are there recent errors at the system level?
journalctl -p err -b --no-pager | head -30
# 5. Did something change recently?
git log --oneline -10 # in the ansible repo
nginx not responding
# Step 1: Is it running?
systemctl status nginx
journalctl -u nginx -n 30
# Step 2: Config syntax error?
nginx -t
# If syntax error, it will point to the file and line
# Step 3: Is it listening?
ss -tlnp | grep nginx
# If not listening: service probably failed to start — read the logs
# Step 4: Is the port open in the firewall?
firewall-cmd --list-all | grep -E "ports|services"
# Step 5: Is SELinux blocking it?
grep "type=AVC" /var/log/audit/audit.log | grep nginx | tail -10
# Step 6: Test from outside
curl -v http://HOSTNAME/
curl -vk https://HOSTNAME/ # -k skips cert check
# Step 7: Check if upstream backend is running (for reverse proxy)
curl -v http://127.0.0.1:8080/ # test backend directly
Common nginx failures:
- 502 Bad Gateway — nginx is up, backend is down or not listening on the expected port
- 504 Gateway Timeout — backend is too slow or hanging; check proxy_read_timeout
- 413 Request Entity Too Large — increase
client_max_body_size - SSL_ERROR_RX_RECORD_TOO_LONG — HTTPS client connecting to HTTP port; check listen and port config
Postfix mail not sending
# Step 1: Is postfix running?
systemctl status postfix
journalctl -u postfix -n 30
# Step 2: What is in the queue?
postqueue -p
# Look at the "stuck" messages — the reason is shown
# Step 3: Try to flush the queue and watch what happens
postqueue -f
journalctl -u postfix -f # watch in another terminal
# Step 4: Test sending manually
echo "test" | mail -s "test" you@example.com
journalctl -u postfix -n 20 # check what happened
Reading queue errors in postqueue -p:
# Connection refused to relayhost
connect to smtp.example.com[10.0.0.2]:25: Connection refused
→ relayhost is down or wrong port in main.cf
# Authentication failure
SASL authentication failed
→ Wrong credentials in sasl_passwd, or sasl_passwd.db not updated (run postmap)
# TLS required but not offered
server requires encryption
→ set smtp_tls_security_level = encrypt (or = may for opportunistic TLS)
# DNS lookup failed
Host or domain name not found. Name service error
→ relayhost hostname does not resolve; add [] brackets to skip MX lookup
# Check DNS resolution of relayhost
dig +short smtp.example.com
# Check TCP connectivity
nc -zv smtp.example.com 587
# Check SASL credentials file
postconf smtp_sasl_password_maps
ls -la /etc/postfix/sasl_passwd.db # must exist and be newer than sasl_passwd
SSH connection failing
# Step 1: Test with verbose output
ssh -vvv user@host 2>&1 | head -50
# Look for:
# - "Connecting to host port 22" — network connectivity
# - "Authentications that can continue" — what the server accepts
# - "No more authentication methods to try" — key not accepted
# Step 2: Is sshd running on the target?
systemctl status sshd
# Step 3: Is port 22 open?
ss -tlnp | grep sshd
firewall-cmd --list-all | grep ssh
# Step 4: Key issues
# Check the key is in authorized_keys
cat ~/.ssh/authorized_keys | grep "$(cut -d' ' -f2 ~/.ssh/id_ed25519.pub)"
# Check permissions (must be exact)
ls -la ~/.ssh/ # dir: 700
ls -la ~/.ssh/authorized_keys # file: 600
# Step 5: Check SELinux
restorecon -Rv ~/.ssh/ # fix any context issues
# Step 6: Check sshd logs on target
journalctl -u sshd -n 30
Common SSH error messages:
- Connection refused — sshd not running, or wrong port, or firewall blocking
- Connection timed out — network unreachable or firewall silently dropping
- Permission denied (publickey) — key not in authorized_keys, wrong permissions, or key type not accepted
- Host key verification failed — host key changed (or known_hosts is stale); remove the old entry with
ssh-keygen -R hostname
Time sync problems
# Step 1: Is chrony running?
systemctl status chronyd
# Step 2: Is it synced?
chronyc tracking
# Look for "System time" — should be small (milliseconds)
# Look for "Leap status: Normal" — not "Not synchronised"
# Step 3: What sources is it using?
chronyc sources -v
# '*' = currently synced source
# '+' = acceptable source
# '?' = unreachable source
# Step 4: Can it reach the NTP servers?
chronyc sourcestats
ping ntp1.example.com
# Step 5: Force a sync (if clock is far off)
chronyc makestep
# or
chronyc -a makestep
# Step 6: Check the config
cat /etc/chrony.conf | grep server
If all NTP sources show ? (unreachable):
# DNS check
dig +short ntp1.example.com
# Connectivity check (NTP uses UDP port 123)
nc -zuv ntp1.example.com 123
# Firewall check
firewall-cmd --list-all | grep -E "ntp|123"
Login / authentication failing
# Step 1: Is SSSD running?
systemctl status sssd
# Step 2: Can SSSD resolve the user?
id username@example.com
# Step 3: Test HBAC rules
ipa hbactest --user=username --host=$(hostname) --service=sshd --detail
# Step 4: Check Kerberos
kinit username@EXAMPLE.COM
klist # see if a ticket was issued
# Step 5: Check time sync (Kerberos fails with clock skew > 5 min)
chronyc tracking | grep "System time"
date # compare with date on the IPA server
# Step 6: SSSD logs
tail -f /var/log/sssd/sssd_example.com.log
journalctl -u sssd -n 50
# Step 7: PAM auth logs
journalctl -u sshd -n 20 # sshd pam logs
tail -f /var/log/secure
DNS resolution failing
# Step 1: Basic test
dig example.com
nslookup example.com
# Step 2: Which resolver is being used?
cat /etc/resolv.conf
resolvectl status # on systemd-resolved systems
# Step 3: Test with a specific resolver
dig example.com @8.8.8.8
dig example.com @10.0.0.10 # your internal DNS
# Step 4: Is the resolver reachable?
nc -zuv 10.0.0.10 53 # UDP
nc -zv 10.0.0.10 53 # TCP (used for large responses)
# Step 5: Check /etc/hosts for overrides
grep example.com /etc/hosts
# Step 6: Check nsswitch.conf
grep hosts /etc/nsswitch.conf # should be: files dns
Blocked by SELinux
# Step 1: Is SELinux in enforcing mode?
getenforce
# Step 2: Check for recent denials
ausearch -m avc -ts recent | tail -20
grep "type=AVC" /var/log/audit/audit.log | tail -10
# Step 3: Explain the denial
ausearch -m avc -ts recent | audit2why
# Step 4: Quick test — switch to permissive temporarily
setenforce 0
# retry the operation
# if it works in permissive, SELinux is the cause
setenforce 1
# Step 5: Fix it properly
# Check for a boolean that covers this use case:
getsebool -a | grep relevant_keyword
setsebool -P boolean_name on
# Or fix a file context:
semanage fcontext -a -t correct_type_t "/path/to/files(/.*)?"
restorecon -Rv /path/to/files/
Blocked by firewall
# Step 1: Check what is allowed
firewall-cmd --list-all
# Step 2: Test from the client side
nc -zv targethost port
curl -v http://targethost:port
# Step 3: Verify traffic is reaching the server at all
tcpdump -i eth0 port PORT # run on the server; check if packets arrive
# If packets arrive but are rejected:
# → service is down or listening on wrong interface (not a firewall issue)
# If no packets arrive:
# → firewall is blocking (on this host or upstream)
# Step 4: Add the rule
firewall-cmd --permanent --add-port=PORT/tcp
firewall-cmd --reload
Disk full / inodes exhausted
# Step 1: Check disk space AND inodes
df -h # blocks
df -i # inodes (a full inode table looks like free space but still fails)
# Step 2: Find what is using space — stay on the same filesystem with -x
du -xhd1 /var | sort -rh | head # top dirs in /var only, don't cross mounts
du -xhd1 / | sort -rh | head # top dirs at root level
du -sh /var/log/* | sort -rh | head
du -sh /var/spool/* | sort -rh | head
# Large individual files
find /var/log -xdev -size +100M -printf '%s\t%p\n' | sort -rn | head
# Inodes exhausted but df -h shows space free? Find dirs with many tiny files
find / -xdev -type d -printf '%p\n' 2>/dev/null | while read d; do echo "$(ls -A "$d" 2>/dev/null | wc -l) $d"; done | sort -rn | head
# Step 3: Check mail queue size
postqueue -p | wc -l
ls /var/spool/postfix/deferred/ | wc -l
# Step 4: Rotate or truncate logs safely
journalctl --vacuum-size=2G
logrotate -f /etc/logrotate.conf
# Step 5: Truncate a large log that a service still has open
> /var/log/some.log # truncates in place — preserves inode and permissions
# DO NOT rm a log file that a running service has open — the blocks stay allocated
# until the service closes its file descriptor. Use lsof +L1 to find such cases:
lsof +L1 # files that are deleted but still open — space still in use
# Step 6: If the filesystem is on LVM, extend it online instead of cleaning up
sudo lvextend -r -L +5G /dev/vg_data/lv_var # -r resizes the filesystem too
df -h and df -i early — inode exhaustion looks like "plenty of space" to most metrics but still returns ENOSPC. For the LVM extend recipe and how to tell if a volume can grow online, see LVM — Logical Volume Manager.
Postfix queue management
# Show the queue (deferred, active, hold)
postqueue -p
mailq # same, shorthand
# Count queued messages
postqueue -p | grep -c "^[0-9A-F]"
# Flush: attempt to deliver all deferred messages now
postqueue -f
# Delete a specific message by queue ID
postsuper -d QUEUEID
# Delete all deferred messages (use with care)
postsuper -d ALL deferred
# Inspect a specific message including headers
postcat -q QUEUEID
# Delete all messages in queue (emergency only)
postsuper -d ALL
postsuper -d ALL deferred only deletes messages stuck in the deferred queue — messages being actively delivered are unaffected. Use this when a large backlog of undeliverable messages is consuming disk space.
Advanced systemd troubleshooting
# See all failed units
systemctl list-units --failed
# Find which service is slow at boot
systemd-analyze blame
# See the full critical chain for boot time
systemd-analyze critical-chain
# Show full unit properties (all settings, including computed defaults)
systemctl show nginx
# Show the dependency tree of a unit
systemctl list-dependencies nginx
systemctl list-dependencies nginx --reverse # who depends ON nginx
systemctl show nginx outputs every key=value pair for the unit — useful when a setting from a drop-in is not being picked up, or you want to confirm the actual Restart= or ExecStart= value that systemd is using (not just what the file says).
Backend dependency down (DB / cache / queue)
Your app is returning 500s, the app logs say "cannot connect to database" (or Redis, or RabbitMQ, or Kafka). Before touching application config, you need to decide which layer is actually broken. Nine times out of ten the answer is one of four: the backend is down, it is up but unreachable from this host, it is reachable but rejecting auth, or it is authenticating but slow. Work the flow below top-down; each step rules out an entire class of cause.
| Question | How to answer it | If NO, fix here |
|---|---|---|
| 1. Is the backend up? | On the backend host: systemctl status postgresql / redis / rabbitmq-server. Managed service? Check the provider's status page / dashboard. |
Restart the service, follow its own troubleshooting runbook, escalate to the DB/cache team. |
| 2. Is it reachable from the app host? | From the app host: nc -zv db01.internal 5432, redis-cli -h cache01 PING, curl -s rmq01:15672/api/overview. Pair with dig +short on the backend's hostname to catch DNS issues. |
Firewall / security group / Docker network / VPN path — the problem is the network layer, not the backend. See firewalld and nmap. |
| 3. Is the app authenticating? | Connect manually with the app's own credentials: psql -h db01 -U app_user app_db, redis-cli -h cache01 -a "$APP_PASS", rabbitmqctl authenticate_user app_user "$APP_PASS". Watch for "authentication failed", "WRONGPASS", or TLS errors. |
Rotate / re-sync the secret, check the vault/KMS entry matches the backend's user table, verify TLS cert chains if the backend requires TLS auth. |
| 4. Is it responsive — not just reachable? | Time a trivial query: time psql -c 'SELECT 1', redis-cli -h cache01 --latency, rabbitmqctl list_queues name messages consumers. Check saturation metrics: CPU, connection count, queue depth, replication lag. |
Saturation / lock contention / replication lag — the service is "up" but cannot accept new work in time. Scale, kill the offending session, or fail over to a replica. |
Concrete commands for the three most common dependencies:
# PostgreSQL
nc -zv db01.internal 5432 # reachable?
psql "host=db01 user=app_user dbname=app sslmode=require" -c 'SELECT 1'
psql -h db01 -U postgres -c "SELECT count(*) FROM pg_stat_activity;" # connection saturation
psql -h db01 -U postgres -c "SELECT pid, wait_event_type, wait_event, state, query
FROM pg_stat_activity WHERE state <> 'idle';"
# (see /postgres-ops/ for full diagnosis flow)
# Redis
nc -zv cache01 6379
redis-cli -h cache01 PING # expect PONG
redis-cli -h cache01 -a "$APP_PASS" INFO clients # connected_clients, blocked_clients
redis-cli -h cache01 --latency -i 1 # live latency samples
redis-cli -h cache01 SLOWLOG GET 10 # last 10 slow commands
# RabbitMQ
nc -zv rmq01 5672
curl -fsS -u app_user:"$APP_PASS" http://rmq01:15672/api/aliveness-test/%2F
rabbitmqctl list_queues name messages consumers messages_unacknowledged
rabbitmqctl list_connections user state channels recv_oct send_oct # any stuck connections?
Always climb the ladder in order — up → reachable → authenticated → responsive. Jumping straight to "the password must be wrong" or "it's a slow query" without first proving the preceding layers is how thirty-minute outages turn into three-hour ones. If step 1 fails, nothing in steps 2–4 matters yet.